Authors: Vinay Kakade, Shiraz Zaman
In a previous blog post, we discussed the architecture of Feature Service, which manages Machine Learning (ML) feature storage and access at Lyft. In this post, we’ll discuss the architecture of LyftLearn, a system built on Kubernetes, which manages ML model training as well as batch predictions.
ML forms the backbone of the Lyft app and is used in diverse scenarios such as dispatch, pricing, fraud detection, support, and many more. While the modeling technique used by each team is different, a common platform is needed to simplify the development of these models, parallelize model training, track past training runs, visualize their performance, run the models on schedule for retraining, and deploy the trained models for serving. We built LyftLearn to achieve these goals. To satisfy the diverse use-cases of ML at Lyft, we followed the following design principles.
Fast iterations: What differentiates ML model development from the rest of software development is the need for fast iteration. An ML Practitioner needs to quickly evaluate different approaches towards a problem and zoom in on the most promising one. A system that does not support fast iterations isn’t optimal from the perspective of ML modeling.
No restriction on modeling libraries and versions: The field of ML is fast-evolving, with innovation happening in multiple directions and new capabilities frequently getting added to the established modeling libraries. Teams at Lyft use diverse modeling libraries such as sklearn, LightGBM, XGBoost, PyTorch, TensorFlow among others — and different teams use different versions of them. LyftLearn should not restrict the modeling libraries or the versions that can be used, except for rare exceptions due to security reasons.
Layered-cake approach: We have customers who need programmatic access via an API, customers who prefer configuring and creating training jobs via a CLI, and customers who prefer interacting with a GUI. We follow a layered-cake approach where all these modes of access are available, with CLI and GUI generally forming a layer on top of the functionality provided by the API.
Cost visibility: ML training usually forms a big part of the infrastructure cost. The users using the system should know exactly the cost of a training run so that they can decide whether the cost is justified based on expected business impact.
Ease of use: Given the wide adoption of ML across Lyft, we can’t afford to have users needing to go through complex onboarding. The system should be self-serve even for implementing the advanced aspects of ML such as distributed training and hyperparameter tuning. Additionally, it should be integrated with (a) existing data sources to get training data, (b) existing model serving solutions to deploy trained models.
We describe the architecture across various components, namely Model Development, Training, and Batch Prediction, User Dashboard, Image Build, and the underlying Data & Compute infrastructure.
Users can develop the ML model either in a hosted Jupyter notebook environment or a hosted R-studio environment or locally in their favorite editor. If using a hosted environment, the user needs to go to the LyftLearn homepage to select the hardware configuration (such as the number of GPU or CPU cores and memory) and a base image to start with. The system provides a wide selection of base images for common modeling techniques used at Lyft. Teams can create their own custom images as well.
The user can then do the development within the notebook and install any additional dependencies. The user can also connect this remote environment with a Git repository so that the changes can be tracked over time. Once satisfied with the model code, the user can Save Model, which saves a new container consisting of the model code and the additional dependencies overlaid on the base image. The user also needs to specify a version while saving (this could be SHA for the corresponding Git commit) to track changes to the model code over time.
Note that once the user makes the selection of hardware configuration and the base image, the notebook environment is created using the underlying LyftLearn Kubernetes cluster, and this operation takes only a few seconds. The fast spin-up of a new environment helps increase the speed of iteration critical in the model development process. To save cost, notebooks that aren’t used for a few hours are auto-saved and auto-terminated. And, since the spin-up is fast, it doesn’t cause degradation of the user’s experience.
The user can also choose to develop the model locally in their favorite editor. In that case, the user can use a CLI to specify the model code and dependencies and then save the model.
Training and Batch Prediction Jobs
Once a model container is saved, the user can run training jobs using the same. The jobs can be configured and scheduled programmatically using the LyftLearn API or manually using CLI or GUI.
The model container takes hyperparameters and configuration parameters — and the training jobs can be run in parallel for different sets of these. For example, a common scenario is to train (using the same model container) a different model for each of the granular geographies Lyft operates in. This can be easily achieved by having a geographical region as a configuration parameter to the model container, have the model code query different training data based on the region parameter, and then train a model specific to that region. The training jobs run as Kubernetes jobs on the underlying Kubernetes cluster and can be scheduled to run periodically to retrain the model at a regular frequency. LyftLearn supports this parallelization via Flyte, Spark, or Fugue.
After training, there are two ways a model could be deployed for predictions: (1) the model could be deployed as a service and called by another online service for point predictions, such as a pricing model that is called for every ride request in the Lyft app, or (2) the model could be scheduled for periodic batch predictions on large batches of data, such as an incentives model which is scheduled weekly and determines incentives for the passengers for the particular week. We have built a separate system for the models that need point predictions, and we use LyftLearn itself for parallelizing and scheduling models that need batch predictions.
Users can see all the models along with their corresponding versions and their past training and batch prediction runs on GUI. For each run, the user can access the corresponding logs and the model performance metrics. Once a model is developed, users can opt to deploy their model to the production serving layer and manage the complete lifecycle through the GUI.
In the Model Development section, we discussed how a user selects hardware configuration and a base image before starting development. LyftLearn provides a selection of base images with common modeling techniques, as well as users can add their own by invoking an Image Build, which is essentially a wrapper over docker’s image build. Teams typically create their own base image or extend one of the existing base images with team-specific libraries, and it is used for all the models developed by that team.
Data and Compute
To get the above user-facing functionality to work, we rely on the following components:
LyftLearn Kubernetes Cluster: Kubernetes forms the backbone of the computing infrastructure of LyftLearn, and the notebooks, as well as training and batch-prediction jobs, run on Kubernetes. The cluster is optimized for interactive development and long-running jobs. The primary reasons for choosing Kubernetes are: (1) we can package the model code and its dependencies as containers, enabling teams to use different modeling techniques and their versions, (2) starting a new LyftLearn environment takes only a few seconds, enabling fast iterations.
Intermediate Storage: We mount AWS Elastic File System as a Kubernetes volume for each user. Users typically use this to store intermediate data files.
Training Data, Model Metadata, Container Storage: The training data comes from Lyft’s data warehouse and is queried using Hive, Presto, or Spark. The metadata of models (such as ownership information, past runs, metrics) is stored in AWS RDS Aurora. The base images, as well as model containers, are stored in AWS Elastic Container Registry.
We reviewed the architecture of LyftLearn, a large-scale system for model development, training, and batch predictions. The primary design principles followed in LyftLearn’s architecture are support for fast iterations, no restriction on supported modeling libraries and their versions, and a layered-cake approach enabling the system to be accessed programmatically via API, CLI, or GUI. LyftLearn has wide adoption and is used by dozens of teams to build hundreds of models every week.
Huge thanks to Han Wang, Anindya Saha, Drew Hinderhofer, Eric Yaklin, Andy Rosales-Elias, Willie Williams, Saurabh Bajaj, Mayank Juneja, Narek Amirbekian, Jason Zhao, Gil Arditi, Erik Vandekieft, Craig Martell, and Balaji Raghavan for making this work possible.
Interested in working with us? Check out our current job openings!
LyftLearn: ML Model Training Infrastructure built on Kubernetes was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.