Identifying the gaps in Deep Learning
Working at my previous drone startup brought to light a number of complex challenges some of the biggest infrastructure and energy conglomerates were facing. These were challenges that AI could solve by detecting faults in solar panels using thermal images, identifying stages of construction etc. Thousands of dollars and man hours could be saved from a few GB of drone data using ML based computer vision techniques such as classification, semantic segmentation, and more such techniques.
As fate would have it, my to-be co-founder at Segmind, Harish, was working on similar challenges at his prior startup. He was collaborating with institutes like Yale University and Tata Memorial to improve cancer detection by training ML models on pathology imaging data.
Previously when we tried to tackle problems of this scale, we were up against unexpected roadblocks.
No, it wasn't labeling the data. We had a team of annotators to get that done fairly quickly. The real challenge was to figure out which of the 100s of algorithms out there would work best for us.
Once that was complete, we had to update these models constantly when new datasets arrived. Since the field was rapidly changing, we also had to continually try new models simultaneously to experiment with newer techniques and replace the existing pipeline with a new one every few months.
Getting this right meant months and thousands of dollars saved. Getting it wrong meant a young underfunded startup lost an enterprise client.
We realized that this was a problem faced by several other large corporations and smaller startups using machine learning to disrupt a wide range of industries. ML teams at these companies needed fast, scalable, and easily portable pipelines, to improve their efficiencies by orders of magnitudes.
And so we started Segmind with the sole goal of making every ML team 10x more productive than they are today.
How do you accomplish a goal like that? We started at the most fundamental level and worked our way up.
Let’s deep dive into some of the issues we faced with our ML workflows:
Setting up scalable infrastructure
Initially, we bought a bunch of powerful GPUs and built our own servers. While this was a more cost-effective option, we hit our maximum capacity in no time. We then started researching cloud computing, which was scalable and seemed like a better option than on-prem servers. But setting up the infrastructure on the cloud was incredibly complex (something our DL team hated) and DL training on the cloud can get very expensive, very fast, unless it is carefully managed and monitored. So, we realized we needed a platform that was easy to set up and scale.
Setup environments for easier experimentations
A quick start with an environment is crucial to rapid iterations and experimentation of new algorithms. We began with AWS’s deep learning AMIs that provided us with an environment to get a speedy start on AWS but did not give us much of a choice. Old operating system versions, hardcoded framework and CUDA versions, and a bloated image with every framework and library pre-installed did not do the job for us. We needed a simple and flexible way to create an environment with a specific version of the framework. Ideally, docker images did provide us with flexible options but added overhead to pull, build, and manage docker images and containers.
Scaling up experimentation, orchestration
Deep learning approaches involve running multiple experiments on the same dataset with different pre-processing and hyper-parameters to learn more about the model. You often need to scale up to parallel training sessions on GPUs machines. An ideal solution to address distributed and parallel training is to have your pipelines built on an orchestration framework like Kubernetes. However, setting up an entire machine learning infrastructure and pipelines on Kubernetes is a highly complex and huge undertaking with months of building and testing involved.
CI/CD for our pipelines
Once we found a model that worked for us, we needed a way to update the models as soon as new datasets arrived. We also needed to keep trying out new methodologies to help us stay up to speed with the latest developments in the field.
As a startup, we were concerned about cloud costs. For the first few months cloud costs were burning a hole in our pocket and we were spending close to $10,000 a month. Although cloud gives you the flexibility to request as many machines as you want, on demand, you have to make sure you are not wasting resources. Optimizing resource utilisation cut our burn by almost half. Automating infra inherently reduces costs by efficiently managing the underlying infra. In addition, using spot machines for non-critical work further reduced our costs.
What Segmind has built so far
We set out to build the Segmind platform that would help automate two major parts of ML development: Infrastructure and Workflows. Our first goal was to enable users to set up the infra as quickly as possible, without having to deal with multiple elements such as Kubernetes, container images, storage, and security. In addition, we added an easy cost management dashboard along with tools that help manage resources better. Taking these into consideration, we built our first enterprise version that teams could use to set up their own Kubernetes enabled clusters within half an hour instead of days or weeks. We made the cluster creation process smooth and flexible by following the strategy shown below while building out the platform.
- Focus on DLOps: Existing data warehouses and data lakes do not fit unstructured datasets like images, videos, audio and videos.
- Simple and Frictionless UX: Reduce DevOps clutter with a clean, powerful UX, enabling developers to iterate faster and facilitate growth in the deep learning community.
- Personas: We kept in mind Data scientists/ML Engineers and ML team heads/leads/managers, while designing our platform. We wanted Segmind to be the main workflow tool for all personas across the ML team chain.
- Monitoring the open and evolving market: We want to give our users access to new and evolving MLOps tools and libraries. So, we plan to tightly and seamlessly integrate such tools into our platform to keep our users up-to-date with the latest tools.
- Distributed data science, on cloud: Deliver speed and scalability with distributed training in the cloud.
Understanding and improving product offering
As we started working with more individual customers and teams to understand their requirements and pain points, we have improved our features and offerings to benefit the end user.
Some of the key features that we have built are:
- Segmind datastore - Managed datastore enables the users to seamlessly work with the data and share it among teams, while optimizing for costs and data storage inefficiencies.
- Instance sharing - Team members wanted a share feature to work collaboratively. So, we have introduced an Instance Sharing feature, wherein you can share your entire instance with your collaborator.
- Recipes - This is a new feature we’re very excited about. Users can directly work on state-of-the-art algorithms in computer vision, NLP, recommendation systems etc. with a single click with our pre-built Recipes feature. No setup required.
As we grow, we are looking for larger trends to identify large opportunities for Segmind and in this rapidly evolving market, our strategy is to double down on what's working. While we already have the infrastructure lifecycle managed on Segmind, we want to increase depth in the workflow automation toolset, transitioning from an infra automation to workflow automation.
We also plan to support multiple cloud providers over the next two quarters thereby enabling a true multi-cloud platform that we set out to build.
We strive to continue to work with our users and the wider community to simplify cloud for machine learning. We are fuelled by how fundamentally ML is changing the world around us. There is so much more yet to come.