Our Obsession with Continuous Testing
In a previous blog post, we introduced you to Zach Deane-Mayer, a data scientist who runs our core modeling team. One of the most important tools in his team’s arsenal is a data science performance evaluation system created and maintained by our QA team. This system is at the core of our comprehensive testing philosophy that we believe is crucial to delivering a platform that our customers can trust, no matter what DataRobot features they’re using or how they’ve chosen to deploy them. We know from talking with colleagues at other data science companies that our approach to product testing is something that sets us apart from all of the other companies that are just getting started with automated machine learning.
Testing begins on day one of feature development
The groundwork for testing begins the moment that one of our development teams begins work on a new feature. Their first assignment is to collect a ton of datasets that represent every possible use case they can think of with a variety of different encodings, file sizes, etc. For example, when we started work on our time series capability, its development team gathered hundreds of different datasets. This pool was eventually whittled down to a subset that contained US datasets with ASCII characters, Japanese datasets with Kanji characters, extremely small files, extremely large files, files with lots of missing data – you name it. We also let our customers give us datasets to ensure that DataRobot worked well on them. If we can’t find a dataset that simulates every possible use case and with every conceivable dataset type, we create it.
We’ve been automating machine learning for over six years so, at this point, we have hundreds of unique datasets that cover every major use case and feature that DataRobot supports. As we find out about new customer use cases, we add a dataset to simulate them as well.
As development continues, project teams use data science performance evaluations to test (and re-test) their work against these datasets to catch potential bugs early on in the process. Case in point – we currently have a team working on a major new feature that won’t be released for nine to twelve months from now, but they are already using our evaluation system to thoroughly test it.
You hear software developers talk about test-driven development. This is test-driven data science. You get the data first and you make sure that what you build works with the data. It’s the only way to find (and fix) all of the edge cases our customers are likely to encounter in the real world.
We take release testing VERY seriously
We do about four major Enterprise releases and several minor releases each year (last year we had 13 Enterprise releases). Once code is merged from the various project teams, we run a complete battery of tests using hundreds of datasets to make sure all models work with every configuration we support. Next, we compare results from the new release to equivalent results from all previous releases to identify any combination of data, model type, and deployment that may be experiencing a regression in any of the key metrics. And, we compare results across deployments. For example, we may find that high-frequency time series models using Japanese data is performing 10% slower in one type of server configuration. Simulating every possible combination allows us to proactively find and correct any problem or abnormality.
Testing this way also allows us to determine exactly how metrics have improved from the previous release, and make decisions about potential tradeoffs. For example, testing may determine that accuracy for an entire feature or a specific algorithm has improved significantly, but the resulting computations to make it more accurate cause it to run 10% slower. Zach and his team will decide if this is an acceptable tradeoff. Of course, any tradeoffs we make are always documented in the release notes.
Zach is an experienced data scientist. Once he has reviewed all benchmarks in all possible configurations and found all tradeoffs to be acceptable, he signs off on the release. Then, Xavier Conort – a former #1 ranked Kaggle data scientist (one of several on our staff) – reviews the same results so that we always have at least two senior data scientists signing off on a release.
Testing doesn’t end when the release ships
Shipping a release is a major accomplishment, but testing doesn’t end there. Our QA team, led by Meghan Elledge, continuously updates the data science performance evaluation system to ensure the platform always includes the latest datasets and supports new use cases. Why? Because they also do nightly testing!
Our library of hundreds of datasets are organized into ten categories (time series, regression, classification, etc.) and every night, we run a few tests in each set. Over the course of a week, we complete several of these categories, and by the end of the month, all datasets will be tested. That’s hundreds of tests. Then the process starts all over again.
As you can see, we are obsessed with continuous testing. It’s part of our corporate culture and every development team starts thinking about it on day one of their projects. That way, when it’s time to test the overall release and perform the subsequent nightly tests, we’re confident that every possible combination of use case and deployment scenario has been meticulously tested and evaluated. As far as we’re concerned, this is what’s required to be an enterprise-grade automated machine learning platform – especially one that Fortune 500 companies rely on to deliver critical predictions.