Partner Role In Project
SA Technologies was tasked to help Tortoise Ai scale their monolithic machine learning application. Tortoise was forecasting extreme increases in usage owing to its rapid growth and they needed to scale their ML work to meet the demands of these new customers.
The challenge at Tortoise was threefold:
- Monolithic machine learning application that combined many cpu and memory intensive tasks (including ETL, pre-processing, post-processing, and more ETL)
- CI/CD was non existent and code was manually updated on scores of instances
- Scale up and scale down were heavily involved. Teams were manually adding EC2 instances and setting them up for managing load
The team at Tortoise had forecasted a theoretical limit to the number of customers they could support. This limit was going to be reached within 2 months so we had to move very quickly to find the team a solution.
The Solution (summarize how you worked with GCP to provide a solution for your customer)
We started by first breaking apart the monolith into microservices that would split the machine learning work into distinct steps with similar workloads. For example image processing and text processing were separated. A final service ran the processed results through various models and generated the final output.
We wanted to ensure that the infrastructure was scalable and abstracted from the machine learning teams and we wanted to build a mechanism for CICD and autoscaling. We chose to deploy each of the services on GCP Cloud Run and utilized GCP Cloud Build to solve for CI/CD. GCP’s off the shelf tool was the simplest way to achieve scalability without implementing a full Kubernetes solution.
Each service unfortunately depended on the result from another service so we chose to utilize a queue mechanism that would persist the work of the previous service before handing it over to the next. In our case we were able to create multiple GCP Cloud Task queues and built a very minimal framework that would wrap around each service and perform the necessary work before passing it onto the next queue.
We then tuned our framework while it was deployed to improve scale by running multiple processes within each container and handling an increased number of messages. In addition, we provided an artificial load testing framework for Tortoise to be able to tune their container limits as their customer base grew.
The solution allowed Tortoise to comfortably scale their ML work. They began the project serving only 30 customers with a full-time dev-ops engineer to serving nearly 2000 customers all in 7 months. By separating the image, text, and ml-model processing logic Tortoise was able to scale up their work more efficiently by workload. CI/CD with autoscaling meant that their dev-ops engineers could focus on implementing other mission-critical work.
We had an existential infrastructure scenario with very little time to support a very large increase in customers over a very short period of time. With Opalforce’s help we are no longer bottlenecked by our infrastructure and expect to easily support the next 3-5 years of growth on our platform.