Data scientists and machine learning engineers face an uphill task when it comes to managing their projects. Not only do they have to get their code right, but they also need to handle an additional set of challenges to manage infrastructure and track their work in a systematic manner to easily reproduce a model at a later stage.
Models usually require retraining, periodic tuning or complete remodelling to ensure good accuracy and reliability. The problem of Concept Drift is a major area of concern.
Even before we get into CI and CD for an ML pipeline, which deals primarily with testing and pushing the pipeline to a production environment, you need a Continuous Training (CT) system to help you create these models. As this post called Continuous Delivery for Machine Learning by ThoughtWorks points out, Reproducible Model Training is one of the major issues that need to be addressed while building ML apps. CT systems specifically manage this, helping the developer with setting up infrastructure and tracking his work so he can spend more time building and refining the model.
We need systems to help us build and validate these models faster. It must enable the engineer to train a model on-demand or automatically based on certain triggers. This is primarily handled by your Continuous Training (CT) pipeline. You usually start building your models on an interactive platform such as Jupyter notebook. Once you are closer to production-ready model and pipeline, you would want to create a Continous Integration (CI) pipeline out of it to test it and push it to production deployment.
As pointed out earlier, reproducibility is a big worry in machine learning projects but can be handled. While building the models, keeping a historical track of ingredients to an experiment's run can help us recreate any experiment with almost no human-interference. Good CT procedures and practices can eliminate this risk of reproducibility. If done right, they can help you close the loop on CI/CD for MLOps, by automatically running a new training and testing loop every time updated data arrives into the system. Automating tracking for ML experiments solves a lot of these problems. If we track all the important aspects of every trial trail run, it becomes very easy to recreate the experiment and get to your final model.
Anatomy of an ML system
Generally, ML training can be divided into five components; data, algorithms (code), containers, results and compute (hardware). Tracking each of these components can help you make the model easily reproducible and automatically take it to production.
Tracking datasets is as important as tracking your code in an ML project. If your training dataset is in a file-based format, you should use a tool to track and manage multiple versions of your data. Data Version Control or DVC for short is a tool built on top of Git to help you track datasets.
As you keep experimenting with different datasets, pipelines and models, your code keeps changing and evolving. Typically, this code is written in python and is either in a .py format for file/job-based execution or .ipynb format for interactive Jupyter based execution. Tracking your code against all the training runs helps you recreate the run as it is, provided other variables listed below do not change. This also helps move pipelines to production and automatically create a model prediction service. Git is the most common code tracking tool used today. You can use Github, Gitlab or other hosted services to get started.
Often neglected, the environment used for training is one of the toughest things to reproduce. It is very hard to keep a track of all the system and python packages that were installed and environment variables set to run your code. Docker helps you track changes to your environment.
Logging each experiment in a structured manner helps in performing statistical analysis to compare models in a methodical manner. Metrics, hyper-parameters, model artefacts etc. are a few things you must store for each experiment. MLFlow is an open-source platform that comes with a fairly powerful tracking module. Runs can be recorded locally or on a remote server.
Keeping a track of the server specifications for your jobs can help you avoid issues that arise due to change in hardware. Having a different CPU, GPU or Memory on the machine can lead to memory issues, longer training times or a slightly different trained model. If you are using the public cloud, it is easy to recreate an experiment with the same specifications on which the earlier models were trained.
Segmind is an MLOps platform that can help your startup/enterprise build and maintain CT for all your AI model developments. To know more about how Segmind platform helps you increase build speeds by reducing complexities related to Continuous Training (CT), check out our website or contact us to know more.