At Mixpanel, we’ve had our ups and downs with the cloud. We moved off the cloud back in 2011 to dedicated hardware deployed on Softlayer. Although we got rid of some of the noisy neighbor problems we saw on Rackspace, it came at the cost of over-provisioning our processing capacity because it was tough to get machines on-demand. After a lot of soul-searching, we finally decided to move back to the cloud in 2016. At the time, Kubernetes (k8s) was emerging as the industry leader in container orchestration. We really liked Google’s managed k8s offering (GKE), and by late 2017, a large portion of Mixpanel’s infrastructure was running on Google Cloud Platform (GCP).
When we first moved to GKE, we set up two identical clusters. Both were in the same region but in different zones for redundancy and high availability. Each team would deploy services under its own namespace. This gave us a way to scale out our k8s deployments as more and more workloads moved to the cloud. We started with handcrafted YAML files for each of these workloads. Engineers would edit and apply these files ad-hoc. To keep both clusters in sync, they would have to deploy code twice. Occasionally, someone would forget to deploy to the second zone, which led to subtle differences that were hard to debug. Over time, we created a wrapper named mix that would diff the new files against what was running in production to give you an idea of what changes would be applied and then alert you if one zone was out of sync with the other. Soon, mix became the de-facto tool to build and deploy code manually. Say you wanted to deploy a new version of a service foo, you'd run the following commands.
mix build prod/foo
mix deploy <cluster>/<namespace>/foo --image <image>
Manual deploys worked surprisingly well while we were getting our services up and running. More and more features were added to mix to interact not just with k8s but also other GCP services. To avoid dealing with raw YAML files directly, we moved our k8s configuration management to Jsonnet. Jsonnet allowed us to add templates for commonly used paradigms and reuse them in different deployments.
At the same time, we kept adding more k8s clusters. We added more geographically distributed clusters to run the servers handling incoming data to decrease latency perceived by our ingestion API clients. Around the end of 2018, we started evaluating a European Data Residency product. That required us to deploy another full copy of all our services in two zones in the European Union.
We were now up to 12 separate clusters, and many of them ran the same code and had similar configurations.
While manual deploys worked fine when we ran code in just two zones, it quickly became infeasible to keep 12 separate clusters in sync manually. Across all our teams, we run more than 100 separate services and deployments. Many of these services depended on each other, so there were times when we’d have to carefully coordinate deploys to make sure a dependency was fully updated before then updating a service. While adding integration tests and a better alerting system helped improve things, we quickly realized that we needed to approach this problem from first principles. It was fairly clear to us that we needed an automated way to deploy code and a CI/CD setup. We decided to run a POC with two of the most popular recommendations: Spinnaker and Argo.
We were mainly looking for three things:
- How easy is it for developers to build pipelines to deploy a service?
- How easy is it to deploy multiple services at once?
- How comprehensive are the out-of-the-box capabilities of the solutions?
We found Spinnaker to be a lot more user-friendly. Seeing a visual representation of a pipeline being built was excellent. Spinnaker’s built-in Slack messaging integrated nicely with our existing deployment monitoring procedure. Our engineers monitor Slack and will respond quickly when notified about failed deployments. The biggest pain point with Spinnaker was the inability to represent pipelines in code. We tried to work around this limitation by creating a custom library using Jsonnet, but it wasn’t the easiest thing to maintain and run. Lastly, Spinnaker uses Java. This isn’t bad per-se; we do run a few Java services at Mixpanel. However, our engineering team is more familiar with Golang, and we prefer using Golang-based solutions wherever possible.
Argo is a lot less opinionated and is built on k8s. There was no GUI for building pipelines, but the docs were fairly good at explaining everything. Because everything used k8s YAML files under the hood, we could version-control all the changes easily using git. To make changes to a pipeline, all you needed to do was create a new branch with your changes, test it, and merge it into the main branch. We could also reuse many of our Jsonnet templates and tooling to create these workflows, which was a bonus.
In the end, it was a fairly close decision, but we decided to go with Argo.
As we mentioned earlier, one of the key features we were looking for was the ease with which any engineer at Mixpanel could define and update pipelines for deploying their services. Since Argo workflows are just YAML files, we could extend our Jsonnet libraries to help make this process smoother. To illustrate this, one of the most common workflows for an engineer at Mixpanel goes something like this:
- Create a branch for a new feature or bug fix.
- Make sure this branch passes all automated testing and is reviewed.
- Build a docker image based on this branch.
- Deploy the image and any accompanying configurations to production.
While we haven’t figured out a way to automate our engineers' work to write features and bug fixes, the rest of it is fairly trivial with the templates we’ve built using Jsonnet. Here’s an example of what building a pipeline looks like:
This gives you a working pipeline to deploy your service. Adding Slack notifications is easy, as well. We use a separate task and a sequential workflow to notify the relevant channels:
It did require more work than Spinnaker, but we are happy with the more custom messages.
Now that engineers could define pipelines easier, we wanted to focus on the end-user experience of the deployment process. Our deployment options were specified by the directory structure of the YAML configuration directory, so either determining potential services to deploy or which clusters a particular service was running in wasn’t straightforward. To solve this, we made a simple interactive tool (based on https://github.com/AlecAivazis/survey) that prompts the user with potential services and clusters for each service. Instead of a deployer having to remember all clusters where a particular service is running, the service owner updates the config when the service is deployed to a new cluster. The deployer is prompted next time they deploy to include the new instance of that service. An example of the “interactive” deploy tool can be seen below.
In addition, this more holistic deploy tool allows for dependency and configuration coordination. Actively preventing conflicting deployments has become more important as our engineering team has scaled and our service orchestration has gained complexity. We have been extremely pleased with Argo so far; it addresses all of the problems we set out to solve.
Over the last few months, a lot of our services have moved to Argo. The chart below shows the number of Argo-based deploys per month.
While we’ve been pretty happy with k8s and Argo, we realize that improving developer experience is a continuous process (no pun intended). Over the next few months, we’d really like to get to a state where we have a stable pipeline for continuous integration and deploys for our most critical services.
The DevInfra team at Mixpanel acts as a force multiplier, implementing best practices and sophisticated developer tooling across teams to build a world-class platform for our engineers. We identify high-impact projects through top-notch internal analytics and observability — at times, even using Mixpanel itself! If that sounds interesting to you, please reach out to [email protected].