By Murilo Cunha

Before we dive in, what is MLOPs? There is much to say about it, but much like me, I'll keep this short and sweet.

As you could've guessed, MLOps is closely related to DevOps. The main idea is to reduce the friction between operations and development of ML systems. As ML engineers, we are responsible for building ML systems (as opposed to ML models only). That includes data extraction, cleaning, feature engineering, model development, training, hyperparameter tuning, model evaluation, deployment (of the model and all the processing steps as well) and monitoring (data drift, triggering automatic retraining and redeployment, you name it). That difference is important when we think about what we can do to speed up the development phase.

Much like DevOps, adhering to MLOps is not a binary thing. Different organizations may implement different levels of MLOps. But the more models you have in production or development, the more relevant these topics become. One way to assess the level of MLOps in an organization is to ask the questions: how long would a new model take to go from development to production? How hard is it to maintain the model? So, with that in mind, how could we speed things up?

  • Experiment tracking: the early stages of development involve a lot of experimentation, and being able to easily see what was done and the results definitely impacts the overall speed of development.
  • Reproducibility: being able to recreate results allows developers to quickly investigate and explain model behavior (includes keeping track of models, hyperparameters, data, etc.).
  • Pipelines: especially important if we have tons of processing steps related to data cleaning and feature engineering. Due to the experimental nature of development, these pipelines may need to be rerun multiple times.
  • Easy train and deploy: especially if we use cloud resources, it may take some time to create adequate training environments, managing the right dependencies and resources.
  • Testing: testing can help us catch errors early, and avoid that snowball of problems to be dealt with later. This is true for testing code (unit tests) but also testing models.
  • Monitoring: model monitoring also takes quite a bit of effort. Can we detect when the data goes stale? Can we set triggers to send notifications, or even retrain the model on fresh data?

Evidently, I'm glossing over some details. But you get the idea. Not only does this speeds up the whole development process but also ensures a more robust system at the end of the day. Luckily, the big cloud providers have been working hard to do a lot of the MLOps heavy-lifting for us, making them a great starting point.

AzureML and AWS Sagemaker have services for setting up pipelines, experiment tracking, VMs for model training, model deployment and monitoring. GCP also offers the AI Platform. It can be used for model deployment, pipelines and experiment tracking (with managed Kubeflow pipelines). And since they all also offer SDKs it's easy to add them in a CICD script to orchestrate deployments or training jobs.

On the other hand, open-source projects also stepped up to support experiment tracking, data versioning and pipelines side of things.To name a few, there is DVC, Pachyderm, MLFlow and Guild.ai, among others.

As I mentioned there is a lot more to it. I've added some links should you want to read up more on it.

Yours truly, Murilo