Deploying Machine Learning Models at Scale
This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
Deploying machine learning models at scale is one of the most pressing challenges faced by the community of data scientists today, and as ML models get more complex, it’s only getting harder. The most common way machine learning gets deployed today is on PowerPoint slides.
We estimate that fewer than 5 percent of commercial data science projects make it to production. If you want to be part of that share, you need to understand how deployment works, why machine learning is a unique deployment problem, and how to navigate this messy ecosystem.
What is Machine Learning Deployment? The Meaning of Scale
To understand model deployment, you need to understand the difference between writing software and writing software for scale. If you want to write a program that just works for you, it’s pretty easy; you can write code on your computer, and then run it whenever you want. But if you want that software to be able to work for other people across the globe? Well that’s a bit harder.
Software done at scale means that your program or application works for many people, in many locations, and at a reasonable speed. This is very different than writing software locally: in fact, it’s just like the difference between having a garage sale and having a multinational e-commerce store online. The logistics of running the latter are a whole other game.
As technology has become more global and the world more connected, developers are required to create these “scalable” applications more and more often. This has led to the dawn of an entirely new field called DevOps––short for development operations––focused on actually scaling these kinds of applications. DevOps is still nebulously defined, but it can be broadly defined as the process of developers collaborating to create scalable, resilient, and distributed systems. In plainspeak, it’s deploying applications that simply just work for everyone who wants to use them.
Scaling applications and creating these distributed systems is really, really tough. There are entire books, courses, and even graduate degrees about the topic, and it’s only growing in complexity with the growth of new paradigms like microservices and containerization. Some frameworks like the Scale Cube (below) try to make this field digestible, but it’s pretty challenging.
The Unique Challenges of Machine Learning Deployment
Deploying regular software applications is hard—but when that software is a machine learning pipeline, it’s worse! Machine learning has a few unique features that makes deploying it at scale harder:
Multiple Data Science Languages
Normal software applications are usually written in one main programming language that’s purpose-built for production systems, like Java or C++. That’s not the case for machine learning. Models are often built in multiple languages that don’t always interact well with each other. For example, it’s not uncommon to have a machine learning pipeline (a combination of data ingestion, cleaning, and modeling) that starts in R, continues in Python, and ends in Scala. Making those all work together is not trivial.
Data Science Languages Can Be Slow
Python and R are the most popular languages for data science and machine learning applications, but complete production models are rarely deployed in those languages for speed reasons. Porting a Python or R model into a production language like C++ or Java is challenging, and often results in reduced performance (speed, prediction, accuracy, etc.) of the original, trained model.
Machine Learning Can Be Extremely Compute Heavy, and Relies on GPUs
Neural nets can often be very deep (the popular VGGnet model is 16-19 layers deep!), which means that training and using them for inference takes up a lot of compute power. When you want your algorithms to run fast, for millions of people, that can be quite the hindrance.
Additionally, most production machine learning today relies on GPUs. They’re been shown to be far faster, more practical, and more efficient for both training and inference (if you have the budget). But GPUs are scarce and expensive, which adds another layer of complexity to the task of scaling a machine learning project.
Machine Learning Compute Works In Spikes
The last (but not least!) quirk of predictive machine learning deployment is that usage is erratic. Once your algorithms are trained, they’re not used consistently––your customers will only call them when they need them. That can mean that you’re only supporting 100 API calls at 8:00 AM, but 1 Million at 8:30 AM. Scaling up and down like that while making sure not to pay for servers you don’t need is a nightmare.
All in all, combining the fact that application deployment at scale is already extremely difficult with the nuances that machine learning adds into the picture leads to a pretty cloudy picture. It’s no wonder that so few data science projects end up actually making it into production systems.
If you’re on the data science team at a large enterprise, this can be even more frustrating. After taking months to write out your (awesome) models, you’re going to need to hand them over to engineering to deploy at scale. That process can take months, and the models you end up with may not at all resemble what you handed them originally. And if you want to make small changes after, or continually improve your models with new data? Forget about it.
What’s An Engineer To Do? Deployment Approaches
The unfortunate answer to all of these looming questions is that there’s no silver bullet. Deploying machine learning models is and will continue to be difficult, and that’s just a reality that organizations are going to need to deal with. Thankfully though, a few new architectures and products are giving developers more hoses to tame this fire.
If You’re A Tech Giant, Don’t Worry About It!
Perhaps the best indicator that machine learning deployment is indeed a pretty hard thing is how bigger tech companies have responded; more than a few have developed their own internal systems for accomplishing this task. Uber’s platform is called Michelangelo (link below), and it allows employees to develop machine learning products easily. Facebook and Google have also developed these kinds of tools, and it’s not hard to imagine that other tech giants have done and will do the same.
Serverless, Containerization, and Microservices
The serverless movement is directed at worrying less about web server side stuff, and it’s a good fit for machine learning deployment. Because of the spike-y nature of machine learning inference, you want to be able to scale your servers instantly, but then shut them down when they’re not being used. Serverless platforms like AWS Lambda make that a possibility.
Potentially more popular than serverless, the trend of containerization is also making it slightly easier to deploy machine learning models at scale. Instead of using one virtual machine per deployment, containers let you deploy multiple individually packaged instances of your software on the same machine. In general, containers are typically easier to share across development teams and lend well towards reproducible results on the training and inference side.
The microservices architecture is also helping make ML deployment more accessible. Remember how different parts of a machine learning pipeline can be in different languages that don’t work well together? Packaging each part of your pipeline as an individual microservice (accessed through a RESTful API) helps break down that barrier. This feature allows your application pieces to interact through API calls, which is more manageable.