How to Automate Machine Learning
When we launched DataRobot six years ago, people laughed. “Automate machine learning?” they hooted. “Ha ha! It will never work!”
Experts sniffed. “You will never automate machine learning. You will get a weak model. You will get a biased model that will wreck your business.”
Six years, hundreds of customers, thousands of users, and almost a billion models later, they aren’t laughing anymore.
Today, quite a few vendors claim to offer automated machine learning. Big firms and small firms; legacy firms and startups. We’re thrilled. Imitation is the sincerest form of flattery.
However, we’ve noticed something striking about these DataRobot imitators.
They’re doing it wrong.
There is a right way to automate machine learning. And, there are many different wrong ways. We created a guide that helps you learn about the ten critical components for successful automation.
But first, here are the most common wrong ways to automate machine learning.
Some people seem confused about what it means to automate something.
One vendor, for example, offers a drag-and-drop tool for analysis and calls it “automation.” Drag-and-drop UIs are nice. It’s a lot easier to drag-and-drop than it is to write good Python code. But users still have to know what to drag and where to drop it. That takes knowledge. Drag-and-drop tools won’t help you “democratize” machine learning if that’s your goal.
Another company claims that batch scoring jobs running under a scheduler “automate” machine learning. Schedulers are handy. They automate routine production jobs. That’s great. But they don’t automate the hard parts of machine learning. Someone still has to train and validate a model.
That’s the hard part.
DataRobot automates the hard things, like model training and validation.
If you want to fool people, tell them you have a special algorithm. It does everything, you claim, so there’s no need to use those other algorithms that most data scientists use. Show customers a white paper that explains why your algorithm is special.
To pull this off, you may need one or two university professors to help explain your algorithm’s specialness.
There are two problems with this approach.
First, there is no such thing as an algorithm that outperforms all others on all problems. Machine learning developers deal in tradeoffs. You build an algorithm that works well on some problems, at the expense of good performance on others.
The second problem is transparency. If you use an algorithm outside the mainstream, few people will understand how to use it. Your customers will have a hard time finding and hiring people to work with your tools.
Of course, if your goal is to lock in customers, that’s a feature, not a bug.
The One-Trick Pony
Some vendors build an automated machine learning engine on a single algorithm. They claim that one algorithm is all you need. You just need to engineer features and tune the model properly.
This is nonsense. One algorithm can outperform others on one use case, but it won’t outperform others on all use cases. For consistent quality across diverse use cases, you have to try many different algorithms.
Several vendors use nothing but deep learning. Deep learning is cool. For massively featured problems like image recognition, it’s often the best technique to use. DataRobot uses deep learning, together with many other techniques.
Why do vendors trust in a single algorithm? Sometimes, it’s blind faith. In the machine learning community, some people prefer to specialize in one technique, such as deep learning.
Delivering software is easier if you use just one algorithm. Machine learning is messy. If you can convince customers that one algorithm is all you need, you can save on software engineering, testing, and product development. You don’t have to build tools that automatically compare algorithms, because there’s nothing to compare!
It’s easy to build an automatic transmission for a car that only has one gear.
There are also companies that keep everything very, very, secret. Their techniques are so special, they dare not reveal them. Conveniently, they can’t disclose their reference customers, either.
To pull this off, it helps if you can convince customers that your founders previously worked for the CIA, KGB, or Mossad.
Here’s a reality check. Like most technologies, machine learning advances in small incremental steps. Great leaps forward are rare. And in the past fifty years, all of the big advances happened out in the open — not in closed workshops.
Don’t let vendors tell you their secret algorithm is a lot better than the others. They’re feeding you a line.
The Bag of Parts
There are vendors that tell you their platform is great for automated machine learning. When you probe for details, however, you discover that they mean you can build your own engine on their platform. All you have to do is write some code, add some open source automation software and voila — you’re off to the races.
There is a problem with this value proposition.
You need to save time and leverage talent. That is why automation appeals to you. You want to deliver more data products with the same people or spend more time on high-value projects.
But if talent is scarce, where are you going to find the time to build and maintain complex software?
Vendors that encourage you to build your own automated machine learning engine aren’t selling you a refrigerator. They’re selling you a bag of refrigerator parts.
Old Wine in New Bottles
Modern machine learning software distributes workload over many servers. That way, you can perform many tasks in parallel, and scale out to handle very large problems.
But distributed machine learning is just 8-10 years old. Tools that are older than that run on one server only. If your computing problem exceeds the capacity of one server — well, that’s just too bad.
Legacy software vendors figure they can build an automated machine learning engine that runs on top of their existing software. That approach rarely works well. Automated machine learning engines run a lot of experiments. You have to run those experiments in parallel, or you will wait a long time to see results.
Check the documentation. When software vendors build on a legacy platform, they usually warn you that the automated engine takes a long time to run. One company even tells you to schedule automated runs for nights and weekends.
That’s just sad.
Half a Bridge
Many machine learning vendors automate one or two parts of the workflow and leave the rest for you to do manually. That’s like selling you half a bridge. What are you going to do with half a bridge? Drive halfway across the river, then swim the rest of the way?
Automated machine learning is powerful because it helps you bring new users into the process. With built-in quality assurance, you can trust novice users to build reliable models. Your most valuable experts can take on coaching and advisory roles, or they can work on the most challenging models.
You can’t do that with partially automated tools. If any part of the process is manual, your expert users must perform every task. Otherwise, there is too much risk that novices will make mistakes on the manual parts. As a result, the capacity of your expert data scientists limits your entire machine learning program.
For machine learning at the speed of today’s business, rethink the workflow — and automate all of it.
How to Automate Machine Learning
Yes, there is a right way to automate machine learning:
Automate the hard things.
Use mainstream algorithms.
Use diverse algorithms.
Build an engine that works out of the box.
Build for high performance and scalability.
Automate the complete machine learning workflow.
We created this guide to automated machine learning on our website. Learn more about the ten critical components for an automated machine learning platform. There’s also an interactive guide to the machine learning workflow, and how we automate each step.
Because when you’re investing in an automated machine learning platform, you want to make sure it’s right.