Automation in MLOps
In my last article, Where does MLOps fit in the Digital Transformation journey?, I describe MLOps as a synthesis of DevOps practices with ML-specific processes to manage models end to end. This definition still stands but in the automation world, the focus is on automating workflows. To support the rest of the article, let’s change the definition to:
MLOps is a set of complex enterprise workflows which are cross-functional in nature and rely heavily on an organization’s data assets.
Like other critical business processes, technology is used to abstract complexity away from decision-makers and empower engineering and operations personnel to build scalable solutions.
It’s 2022 and there is no standard MLOps technology stack yet. Enterprises continue to piece-meal a combination of open-source tools and paid-services to support their present use cases. The industry’s ambitions are two fold: build a flexible AI platform with a consistent user experience tightly integrating the tools MLOps professionals use today and automate everything. In this blog post, I will identify the current areas of automation in MLOps.
Before I cover the current automation focus in MLOps, let us zoom out to understand the problem space of which ML projects reside. The industry has universally accepted practices for developing software and managing software in production; Examples of these include the software development lifecycle, agile development process, CI/CD, and DevOps. DevSecOps leaders continuously preach reliability and observability through automation to deliver secure working software fast. The impact of these industry-standards and processes cannot be understated when it comes to traditional software. With that in mind, why can’t ML projects fit neatly into enterprises’ existing development workflow?
Across the ML lifecycle, there are 3 main artifacts that need to be managed: data, model, and code. Team topologies have formed around these three artifacts and specialized roles have been created like Data Scientists, Machine Learning Engineers, Data Engineers. Tools are being developed to assist these personas in performing their most critical tasks while DevOps principles are being applied to models and data (both first-class citizens of machine learning projects). I would like to highlight automation efforts across the ML lifecycle: data preparation, model development, deployment, and monitoring.
Note — concept definitions — ML projects are created to solve business problems and output models as standalone solutions or embedded into larger software applications. Platforms are digital spaces that enable users to create and manage ML artifacts in a repeatable and scalable way.
Current Automation in MLOps
Data Preparation
Data is the essential input to any machine learning project and it must first be collected and curated based on the identified business problem.
With data preparation, the end goal is to produce an ML artifact (example — a quality, labelled dataset) which closely resembles the real-world.
There are plenty of tools and services which help data science teams ingest, transform, manage, and serve data to their models for training and inference purposes. Below are some examples of automation in the data preparation space:
- Auto Data Profiling — to catch data issues as early as possible using Superconductive or Great Expectations
- Auto Data Ingestion —for automated reliable secure data pipelines using Fivetran
- Auto Synthetic Data Generation — to generate ‘fake’ data when collecting more ‘real’ data is not feasible using Hazy, Tonic.AI, or Datomize
- Auto Data Labelling — to programmatically label data using Snorkel.AI
Model Development
Data science teams need a workspace to conduct rigorous data analysis and test hypotheses (also known as experimentation). This phase of the machine learning project lifecycle is extremely iterative, collaborative, and compute-intensive as data scientists train and tune the best performing model. Applying automation techniques to assist the data scientist with the optimization tasks frees up time and attention for higher-level work. Below are some examples of automation in the model development space:
- Auto ML —to perform feature engineering, model experimentation, hyper-parameter tuning, and model selection given some constraints and criteria set by the user. AutoML is a common offering in the market and improving dramatically with the low-code no-code ML movement.
Deployment
Only machine learning projects with models in production can claim victory. MLOps platforms seek to make the process of deploying models to higher environments as easy as possible. Machine learning engineers are tasked with validating the best performing model in development works in staging and production environments. The complex processes of packaging and deploying models to one or multiple deployment target environments as well as assessing infrastructure and model performance impact make deployment the most challenging phase in the ML solution lifecycle. Below are some examples of automation in the deployment space:
- Auto-Deployment — to publish models to specified environments in one click deployment or after passing some test (example — deploy the better performing model in champion/challenger test, or new version of existing model after retraining occurred) using DataRobot
Monitoring
As soon as models are live and start serving predictions their performance will start to decay. Real world environments are in constant flux and each model has some different levels of risk associated which adds to the complexity. Monitoring models is crucial for machine learning engineers to determine whether to alert business stakeholders, retrain/replace models in production, revert back to a manual or rule-based system based on key model performance metrics. It also serves as the first step in providing the business observability of the model development lifecycle. Below are some examples of automation in the monitoring space:
- Auto Monitoring — to monitor model’s desired performance metrics immediately after deployment using Fiddler, Arize, Verta.AI
- Auto Alerting — to streamline the alerts that reach an operations team regarding model issues (example — massive data drift, concept drift, availability or performance failure)
With automation happening at every stage in the MLOps, I see more opportunities for data teams to build trust in ML for their enterprises, customers, and communities.