Enterprise companies find MLOps critical for reliability and performance

Rish Joshi Contributor

Rish is an entrepreneur and investor. Previously, he was a VC at Gradient Ventures (Google’s AI fund), co-founded a fintech startup building an analytics platform for SEC filings and worked on deep-learning research as a graduate student in computer science at MIT.

The rise of MLOps

As enterprises adopt auto-ML workflows, one of the issues they’re commonly seeing is that many of the models built by data scientists never make it into production. There are a number of issues that can stop deployment, including models that underperform in pre-production environments, incompatibilities between production environments and the model-training environment, or inconsistencies with production infrastructure.

This is where MLOps comes in.

The world of MLOps has been shaped a fair bit by the evolution of DevOps, which has rocketed to popularity the past few years. The role of DevOps is to efficiently integrate and deploy source code, and it’s typically managed by a DevOps engineer who works as a bridge between IT and developers.

MLOps is similar, but focuses on the ML model and data sets as opposed to code. These days, data engineers run MLOps, but it’s likely the specialized role of MLOps engineer will come about soon.

There are four components to the modern MLOps workflow:

Continuous Integration: In DevOps, this refers to synchronizing new code with the existing code base, whereas in MLOps, this process refers to synchronizing the data and models. This involves checks such as confirming that a model mathematically converges, making sure it does not result in data-type errors, and running tests on sub-methods within the model to ensure they’re working as expected.

Continuous Deployment: In DevOps, this refers to moving code into production, and it’s the same with MLOps, except with models instead of code. This involves checks such as ensuring that the libraries required for a model to run exist in the production environment, testing the model with sample input data to verify it’s producing the expected outputs and testing performance metrics in pre-production.

Monitoring: Once a model has been deployed, it needs to be actively evaluated to ensure that it’s working as desired, both in terms of accuracy and runtime speed. MLOps solutions look at metrics such as data drift (assessing whether a model is losing its accuracy as input data changes) and performance around run time and latency.

Governance: For an enterprise company that would likely have many algorithms in production at once, issues can crop up requiring a data scientist to look into what’s causing a model to not work as expected. Having an end-to-end system that enables tracking by model of which data it was trained on, who built the model and when, and other such factors, can be helpful. Further, maintaining this data is helpful for compliance purposes.

How companies like DataRobot have driven the need for MLOps

DataRobot’s enterprise AI platform helps customers streamline the full ML life cycle across data preparation, model building and model deployment. H20.ai offers a similar solution to DataRobot called H20 Driverless AI, which provides end-to-end automated AI capabilities. One of the key differences between the two platforms lies with their target users, as H20.ai tends to cater to more technical users, whereas DataRobot serves business and IT folks along with data scientists.

Beyond end-to-end AI workflow platforms, the auto-ML market has been flooded with many companies providing tools for various parts of the enterprise AI stack. Cloud providers, including Amazon, Microsoft and Google, have innovated by developing auto-ML capabilities for cloud customers. Specialized platforms such as Domino Data Lab offer solutions for advanced users, and many tools such as TensorFlow and pre-built classifiers are readily accessible to developers for model building.

In the case of end-to-end AI workflow platforms such as DataRobot, some of the key benefits for enterprises have included the automation of various parts of the workflow, particularly around feature engineering and model generation, and the efficiency that comes with consolidating the entire workflow onto a single platform.

That’s perhaps a lot of buzzwords, so let’s consider the case of a security team at a credit card company assessing fraud risk for users. Let’s assume the input data consists of rows pertaining to end customers, with each row containing metadata including the day the customer’s card was activated, the day it expired and the number of fraudulent events identified in that time frame.

In order to effectively model the fraud risk, the security team would need to take the difference between the card activation and card expiration days and tie that to the number of fraudulent events identified. This is called feature engineering, which involves combining the input features in such a way that helps an ML model learn the underlying patterns as best as possible.

This may look simple, but problems often have a large number of input data columns that can greatly increase the number of combinations one has to try — and the relations between different data points may not be easy to discern, either.

Automated feature engineering makes this process simpler by auto-testing many different combinations of input features, quickly and at scale, to help the user pick the best one.

Once a user has finalized the set of features, DataRobot’s automated model generation capability lets them run many different types of models on the data, and see which ones perform best. This saves users the time of building models from scratch, and also gives them the benefit of seeing how different models perform.

Moreover, in situations where the data is rapidly changing, it gives users the ability to rerun the full set of models and re-determine which ones work best based on new data. In the case of the security team at the credit card company, consider a model that was developed in a particular region. If the security team is tasked with understanding fraud risk in another region and further receives some new data columns specific to that region, it’s possible the initial models won’t perform as well as new models that take all the available data into account.

The consolidation of the entire workflow into a single platform also provides several benefits for users. On the model building side, the coupling of data to a variety of models can make experimenting easier and help debug any issues that come up with the models much quicker. On the model deployment side, it helps with tracking source data and model attributes for models in deployment, both for any changes that become necessary and for governance.

Though companies like DataRobot and H20.ai offer end-to-end AI workflow platforms, the drive toward automating these workflows has not solely been confined to a single vendor solution. Given the modularity between data prep, feature engineering and model development, enterprises are often using permutations of a number of different solutions to satisfy their requirements.

In DataRobot’s case, use of their products alongside Snowflake and Tableau has been a popular ask by customers. Customers commonly tend to use ML tools offered by cloud providers in conjunction with DataRobot and H20.ai’s products as well, and both of them provide tight integration with the major cloud providers.

The rapidly expanding MLOps solutions market

The market for MLOps solutions has been growing over the past year as enterprises focused their efforts on model deployment and governance following the widespread adoption of auto-ML tools.

DataRobot recently acquired ParallelM, one of the early entrants in the MLOps space back in 2017, which enables customers to deploy models to infrastructure such as Kubernetes and Spark, either on-premise or on one of the major cloud providers. H20.ai partnered last year with ParallelM’s MLOps solution, as well.

The MLOps space is also seeing open-source solutions prop up. KubeFlow is an open-source tool that enables MLOps capabilities for deploying to Kubernetes, and, similar to TensorFlow, it began as a project based on Google’s internal ML pipelines. DataBricks has released an open-source tool called MLFlow, which provides full life cycle workflows for ML development, including MLOps with deployment capabilities to Apache Spark.

The major cloud providers have also made their own forays into this category. Amazon SageMaker has introduced MLOps capabilities by helping customers leverage AWS Lambda and Step Functions for deploying models. Microsoft Azure has enabled tight integration between its auto-ML platform Azure Machine Learning and its Azure DevOps platform to enable MLOps functionality. Google Cloud has similarly moved to providing MLOps capabilities by outlining use of TensorFlow and KubeFlow along with Google Build.

Enterprises deciding on which MLOps solution to use will likely consider the following two factors: the auto-ML platform they’re using, and the orchestration framework to which they plan to deploy. For enterprises using a cloud auto-ML platform such as Amazon SageMaker, the default choice will likely be to use the associated integrations from the cloud provider and string together an MLOps workflow. The same will likely be true for standalone platforms such as DataRobot, which provide auto-ML tools with an associated MLOps capability.

Kubernetes has increasingly been a popular scalable orchestration platform for ML workloads. MLOps solutions such as KubeFlow, which help deploy to Kubernetes, and ParallelM’s MCenter product, which also supports Kubernetes, are likely to see growing adoption, given the widespread use of Kubernetes. Another advantage of Kubernetes is its ability to help streamline hybrid deployments across on-prem and cloud, which many companies demand, such as OpenAI, which uses Kubernetes across on-prem, and Microsoft Azure.

The MLOps market will not likely be a winner-take-all. We’ll likely see continued effort on part of auto-ML providers to create tight integrations that enable MLOps capabilities for their customers, and we’ll also see select deployment practices such as the use of Kubernetes continue to grow as developers begin to prioritize deployment possibilities from the outset as they consider different ML workflow platform providers.