MIT BuildVBuy 1920x600

When it Comes to MLOps, Asking “Build vs Buy” Is the Wrong Question

May 17, 2022
· 6 min read

Companies hiring expensive data scientists and machine learning (ML) engineers expect them to focus on moving the business forward, by building AI/ML capabilities and applications that are very specific to their operations, industry, and customers.

But, frankly, the failure rate of enterprise investments in AI are staggering: analysis by VentureBeat suggests that only 21% of companies have AI “deployed across the business.” That sounds bad, and it is. But, on the flip side, the 13% that do succeed are seeing a big impact.

Clearly, then, increasing the number of businesses using AI/ML is a problem worth focusing on. And that’s where ML operations management, or MLOps, platforms come in.

Bogged Down in ML Operations Management

It’s interesting how many businesses get bogged down in ML Operations management. The idea of an MLOps platform is to manage the lifecycle of machine learning models – specifically  deploying, maintaining, and updating – in much the same way as application lifecycle management methods and systems are used in software development. The goal is to provide the guardrails and processes necessary to be able to deploy, manage, operate, and secure ML workloads at scale.

A big part of the reason behind enterprise AI’s relatively anemic success rate is that too many businesses still spend too much time and energy building – or attempting to build – their own ML operations management platforms.

ML operations management platforms are essential to getting models into production and keeping them there. A model or pipeline that is not in production is one that cannot provide any value (or limited value) to the business. But while they have very high operational benefits, they are not value-add from a business point of view. Technologists love building. I am one of those. But usually our mission is to use technology to move the needle for the business.

Data scientists have the most impact exploring strategies, building models, feature engineering, and working with hyperparameters. Machine learning engineers add value by making sure the deployment and scale-out of that work is as quick and reliable as possible.

Is having this talent work on building a platform for any of these steps the value-add? Is having a proprietary MLOps platform a direct competitive advantage? For most businesses, the answer is emphatically “no.”

The Machine Learning Model Lifecycle is Complicated

There’s no other way to say it: building an ML platform is a very heavy lift. Even the biggest and best-resourced teams would hesitate to take it on. It’s also an iterative process that requires version after version of improvements.

Building effective models is a challenge in itself, but to get any business value at all from them, they must be run with a production application. Making this happen requires supporting and tracking several requirements and processes at the deployment, operations, and governance and security stages of the ML model lifecycle.

At the deployment level, you must consider how to manage model versioning to track changes and enable roll-backs if needed. You also need to be able to publish models to a model catalog so that other people can find and reuse them, which is especially valuable when models have already been trained. Models don’t live in vacuums – sometimes they consist of ensembles of multiple classifiers, in other cases they are pipelines with pre- and post-processing functions – so a pipelining function is also needed to track dependencies and control development and deployment. There’s more: the source code needs managing, APIs need to be provisioned to enhance adoption and scalability, and data source connections need to be established and maintained.

Now imagine doing that for tons of frameworks and libraries, languages, processing dependencies, and data connections. Your buildout matrix has just become quite big. And that’s only the first layer of MLOps.

Next comes the operations side. Here we need to think about integrating with the CI/CD pipeline to accelerate production and leverage automated testing, deployment, and updates. Application integration is also an important consideration, whether that’s for visualizations or simply other infrastructure applications. Performance monitoring and alerting is required for every model. And where are you running this model – in the cloud, on-prem, multi-cloud… all of the above? Are you going to be able to build something that’s consistently transferrable to each environment?

Then you get down to the real nitty-gritty of production: governance and security. Can you guarantee the data security, network security, and model security? Are you correctly applying the InfoSec policy from your organization to your ML workloads? Are you doing all the right things from a permissions perspective? If you work in a regulated industry, can you package up all of the audit trails for compliance? Can you at any point determine the 4 W’s of Who called, What, When, and Why?

This is what you need to think about for a true production system. All of this is a lot of work – and that’s just to get the first version up and running. Then – wait for it – what happens when you inevitably need to upgrade something?

The right question is “build and maintain or buy?”

Hopefully you have a sense now that there is a lot more to building an MLOps platform than building.

I’ve met multiple companies that have successfully built out stunningly complex internal ML platforms. But the one constant we see is that the investment gets bigger and bigger every year. And that’s fine, as long as you know from the outset that building a solution such as this is a journey. It’s not a decision that you make at one point in time.

Many organizations start with a recipe or blueprint approach for productionizing models. That tends to break down when you have hundreds of models to control, observe, and track. It leads to a lot of “let’s go build X,” which is a distraction from value-add work and, as an operating model, is at serious risk of being blown up by changes in resourcing priorities.

Furthermore, large organizations will have dozens of teams each using dozens of tools. By definition, an MLOps platform that can run all these workloads in a consistent way has to be expensive – it has to be continuously evolving.

At risk of someone muttering “Yeah, you would say that,” 9.9 times out of 10 you will get the best value from investing in a platform that allows you to accelerate and standardize production. Not only will that platform perform better, sooner, than an internally built solution, but also it will free up your very expensive data science and ML engineering teams to build new and added-value use cases instead.

Time to Start Treating ML Development Like Software Development

One of the stock objections to buying an ML operations management platform is the perceived lack of flexibility.

In some areas of the business this kind of argument stacks up but not when it comes to ML lifecycle management. Take a look at the world of software development and look at your current stack there. For the vast majority of organizations, buying solutions and components to manage software production has not left them gripped by inflexibility – if anything there has been great success in loosely coupled but tightly integrated bought components.

It’s time to think about the ML development lifecycle like the software development lifecycle: adopt best-in-class components for every single piece of the value chain. As I noted at the start, the success rate of enterprise ML is too low for such a high investment and high potential impact technology – this is an urgent problem.

Ultimately, ML is even more iterative than software development. We need to accelerate production to get models out there fast, so the bad can die and the great can prosper. The credibility of the discipline depends on it.

See DataRobot in Action
See a demo
About the author

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog