When it Comes to MLOps, Asking “Build vs Buy” Is the Wrong Question
Companies hiring expensive data scientists and machine learning (ML) engineers expect them to focus on moving the business forward, by building AI/ML capabilities and applications that are very specific to their operations, industry, and customers.
But, frankly, the failure rate of enterprise investments in AI are staggering: analysis by VentureBeat suggests that only 21% of companies have AI “deployed across the business.” That sounds bad, and it is. But, on the flip side, the 13% that do succeed are seeing a big impact.
Clearly, then, increasing the number of businesses using AI/ML is a problem worth focusing on. And that’s where ML operations management, or MLOps, platforms come in.
Bogged Down in ML Operations Management
It’s interesting how many businesses get bogged down in ML Operations management. The idea of an MLOps platform is to manage the lifecycle of machine learning models – specifically deploying, maintaining, and updating – in much the same way as application lifecycle management methods and systems are used in software development. The goal is to provide the guardrails and processes necessary to be able to deploy, manage, operate, and secure ML workloads at scale.
A big part of the reason behind enterprise AI’s relatively anemic success rate is that too many businesses still spend too much time and energy building – or attempting to build – their own ML operations management platforms.
ML operations management platforms are essential to getting models into production and keeping them there. A model or pipeline that is not in production is one that cannot provide any value (or limited value) to the business. But while they have very high operational benefits, they are not value-add from a business point of view. Technologists love building. I am one of those. But usually our mission is to use technology to move the needle for the business.
Data scientists have the most impact exploring strategies, building models, feature engineering, and working with hyperparameters. Machine learning engineers add value by making sure the deployment and scale-out of that work is as quick and reliable as possible.
Is having this talent work on building a platform for any of these steps the value-add? Is having a proprietary MLOps platform a direct competitive advantage? For most businesses, the answer is emphatically “no.”
The Machine Learning Model Lifecycle is Complicated
There’s no other way to say it: building an ML platform is a very heavy lift. Even the biggest and best-resourced teams would hesitate to take it on. It’s also an iterative process that requires version after version of improvements.
Building effective models is a challenge in itself, but to get any business value at all from them, they must be run with a production application. Making this happen requires supporting and tracking several requirements and processes at the deployment, operations, and governance and security stages of the ML model lifecycle.
At the deployment level, you must consider how to manage model versioning to track changes and enable roll-backs if needed. You also need to be able to publish models to a model catalog so that other people can find and reuse them, which is especially valuable when models have already been trained. Models don’t live in vacuums – sometimes they consist of ensembles of multiple classifiers, in other cases they are pipelines with pre- and post-processing functions – so a pipelining function is also needed to track dependencies and control development and deployment. There’s more: the source code needs managing, APIs need to be provisioned to enhance adoption and scalability, and data source connections need to be established and maintained.
Now imagine doing that for tons of frameworks and libraries, languages, processing dependencies, and data connections. Your buildout matrix has just become quite big. And that’s only the first layer of MLOps.
Next comes the operations side. Here we need to think about integrating with the CI/CD pipeline to accelerate production and leverage automated testing, deployment, and updates. Application integration is also an important consideration, whether that’s for visualizations or simply other infrastructure applications. Performance monitoring and alerting is required for every model. And where are you running this model – in the cloud, on-prem, multi-cloud… all of the above? Are you going to be able to build something that’s consistently transferrable to each environment?
Then you get down to the real nitty-gritty of production: governance and security. Can you guarantee the data security, network security, and model security? Are you correctly applying the InfoSec policy from your organization to your ML workloads? Are you doing all the right things from a permissions perspective? If you work in a regulated industry, can you package up all of the audit trails for compliance? Can you at any point determine the 4 W’s of Who called, What, When, and Why?
This is what you need to think about for a true production system. All of this is a lot of work – and that’s just to get the first version up and running. Then – wait for it – what happens when you inevitably need to upgrade something?
The right question is “build and maintain or buy?”
Hopefully you have a sense now that there is a lot more to building an MLOps platform than building.
I’ve met multiple companies that have successfully built out stunningly complex internal ML platforms. But the one constant we see is that the investment gets bigger and bigger every year. And that’s fine, as long as you know from the outset that building a solution such as this is a journey. It’s not a decision that you make at one point in time.
Many organizations start with a recipe or blueprint approach for productionizing models. That tends to break down when you have hundreds of models to control, observe, and track. It leads to a lot of “let’s go build X,” which is a distraction from value-add work and, as an operating model, is at serious risk of being blown up by changes in resourcing priorities.
Furthermore, large organizations will have dozens of teams each using dozens of tools. By definition, an MLOps platform that can run all these workloads in a consistent way has to be expensive – it has to be continuously evolving.
At risk of someone muttering “Yeah, you would say that,” 9.9 times out of 10 you will get the best value from investing in a platform that allows you to accelerate and standardize production. Not only will that platform perform better, sooner, than an internally built solution, but also it will free up your very expensive data science and ML engineering teams to build new and added-value use cases instead.
Time to Start Treating ML Development Like Software Development
One of the stock objections to buying an ML operations management platform is the perceived lack of flexibility.
In some areas of the business this kind of argument stacks up but not when it comes to ML lifecycle management. Take a look at the world of software development and look at your current stack there. For the vast majority of organizations, buying solutions and components to manage software production has not left them gripped by inflexibility – if anything there has been great success in loosely coupled but tightly integrated bought components.
It’s time to think about the ML development lifecycle like the software development lifecycle: adopt best-in-class components for every single piece of the value chain. As I noted at the start, the success rate of enterprise ML is too low for such a high investment and high potential impact technology – this is an urgent problem.
Ultimately, ML is even more iterative than software development. We need to accelerate production to get models out there fast, so the bad can die and the great can prosper. The credibility of the discipline depends on it.