Safer Deployments to Kubernetes using Canary Rollouts
Over the last few months, I’ve had the opportunity to work on some really exciting problems in the cloud-native space, as an intern at grofers. One such problem was around safely releasing software to Kubernetes using Canary deployments and that’s what we will cover as a part of this post. We’ll discuss the tools used and the considerations made while working with them, and our journey so far.
In this post, I might use ‘Canary deployments’ and ‘Canary rollouts’ interchangeably. Note that they mean the same thing in this context.
Rolling out new changes to production always brings the risk of unplanned downtimes even if the software is thoroughly tested. It is a standard practice to run integration, E2E, and acceptance tests as a part of a deployment pipeline before pushing a release to production. Because the new changes have never really been exposed to live production traffic, no amount of testing can ever guarantee complete stability in production environments, making release engineering all the more challenging.
The solution to this problem is to actually test changes in production — and that is where Canary deployments help. Canary deployment is a release strategy that allows you to evaluate a new version of the software (the Canary) by redirecting only a small percentage of live traffic to it. This evaluation, carried out by metrics analysis, is what helps decide whether to proceed with a rollout or not. The benefit we get from such an approach is that faulty releases are detected during the release process itself while having minimum impact on the users.
The Canary deployment process works by deploying the new version of the software (the Canary) alongside a stable version of the same software. In the context of Kubernetes, this software can be thought of as a Deployment (or a pod). Your application is now comprised of two parts — the Canary deployment and the stable deployment. The part of the application that receives new changes is the “Canary”. A proxy sits above both these versions and performs a traffic split (more about this later).
Canary releasing at grofers
At grofers, we have been running mission-critical services on Kubernetes for about 2 years now. Our deployments handle and process thousands of requests and transactions every minute. It behooves the software engineering teams at grofers to ensure reliability and minimize unplanned downtimes.
Before we deploy our code to production, it is run through a series of checks and tested extensively on staging environments via CI/CD pipelines mostly powered by Jenkins and a bit of Tekton (at the time of writing this). Once the tests and CI are green, a Jenkins job would automate the process of deploying the application on Kubernetes. We wanted to modify the last bit of this release process to incorporate the Canary rollout strategy. Eventually, we’d like to automate this process as well, to reduce toil.
If you want to learn more about CI/CD at grofers, head over to this blog post by Ashish Dubey.
Considerations made while selecting a tool
Much like any other tool, we wanted something that could integrate with our existing infrastructure without much hassle. While Argo and Flagger are both great tools, we wanted something that could fit our use-cases and culture. Spoiler alert — we went ahead with Flagger, but below are a list of factors we considered while choosing to work with it:
Argo Rollouts definitely seemed simpler to use at first, but we were not entirely sure about replacing our Deployment objects withRollout — a CRD provided by Argo Rollouts which would replace theDeployment object. At grofers, we have a large number of micro-services managed by multiple development teams. Performing a migration of resources at such a scale would be a tedious task.
On the other hand, Flagger would seamlessly integrate with your existing Deployment and other resources related to it. Flagger has its own CRD called Canary where you’d specify the target Deployment and configure your release strategy. Much like Argo Rollouts, Flagger too would automate the entire release process when new changes were pushed.
At least initially, manual gating would form an important step in our Canary release process. While we have a lot of automation and alerting set up around our monitoring stack, we are still in the process of defining SLOs and metrics that are relevant to the Canary release process. The decision to promote a Canary to Stable will happen based on evaluating certain metrics that define what failures or bad releases look like. The question is — which metrics to monitor?
While it could have been easy to use standard metrics such as USE/RED and restrict the metric analysis to just the target deployment, our micro-services have complex dependencies on each other. Failures in one service have the potential to affect other services as well. Hence, we should standardize our metric analysis process for Canary releasing. Only once we have figured out what all needs to be done, we can let the tool automate the analysis for us. To achieve this, we wanted to have some degree of manual control in the metric analysis part of the Canary release process, at least initially.
Argo Rollouts offers this feature right out of the box! The Rollout object lets you introduce the need for manual approvals before the next step of the Canary analysis begins. Further, they provide a kubectl plugin as well which lets you manually control the different stages in the Canary analysis. Flagger on the other hand, unfortunately, did not come with this level of flexibility in terms of manual gating. However, it was easier for us to find a work-around for this (more about this later) than migrate hundreds of Deployment objects to Rollout .
Integration with Ingress controllers / API Gateways
The traffic split between the Canary and stable version forms another important step in the Canary deployment process. Neither Argo nor Flagger directly performs traffic splitting for you. Instead, they integrate with external Ingress controllers.
We have migrated a major portion of our infrastructure to Kubernetes and are still in the process of doing so. One such component that we’d like to see running on Kubernetes is the API gateway. At grofers, we’re using Kong as an API Gateway that sits outside Kubernetes (on EC2 and AutoScaling Groups). While we could definitely hack our way around and make Kong work on Kubernetes, we would prefer a more native solution. Out of the many ingress solutions available for Kubernetes, we shortlisted Kong for Kubernetes and Gloo Edge because of their API gateway-like capabilities which was an important requirement for us.
Fortunately, Flagger could directly integrate with Gloo Edge. Argo Rollouts, on the other hand, did not support Gloo Edge or any other ingress that we could re-use for our existing networking requirements. Using Argo Rollouts would mean we’d have to take care of yet another additional component just for the sake of having Canary deployments.
The issues we faced and how we’re trying to fix them
Introducing a Canary deployment strategy to our existing deployment pipelines hasn’t been very straightforward. Keeping the above factors in mind, we decided to go ahead with Flagger. Though it seems to have served us very well so far, there were some rough edges specific to our use-cases that we’re actively trying to work on.
Manual gating — kubectl plugin and Slack bot
As explained in the previous section, manual gating would form an important step in our Canary deployment process. We were particularly interested in controlling the step-weight increase of the traffic on the Canary. Flagger could achieve manual gating using webhooks, but it did not come with a webhook configuration for the aforementioned use case. We issued a fix to the upstream project to add this feature.
Now that we had a way to control the traffic percentage, we wanted a very simple way for our developers to interact with this system, keeping developer experience as our top priority. We initially started by building a kubectl plugin similar to the one provided by Argo Rollouts. This plugin would interact with the webhook to promote the Canary to the next step or roll it back to the previous version. This solved a very big problem for us — it abstracted the underlying workings of the system away so that developers would only have to focus on whether they want to promote or roll back the Canary.
But that was not the end of it. This solution came with its own set of drawbacks — the major one being around providing cluster access. Each developer that wanted to control the Canary would have to be given additional access to the Kubernetes cluster which seemed like a risk from a security perspective. To solve this problem, we extended our approach to work as a Slack command bot instead of a kubectl plugin. The advantages that this brought for us were:
- Development teams would no longer be required to directly interact with the Kubernetes cluster as was the case with the kubectl plugin. Instead, they could issue the same commands over Slack, and the Slack bot would execute these instructions over to the Canary deployment.
- One less tool to worry about! We use Slack internally for our daily communication. Having the ability to directly control your Canary deployments from Slack itself seemed like a great user experience.
Alternatively, we had also considered creating a Jenkins pipeline for triggering manual promotions. However, as mentioned above in point (2), a Slack bot would offer better usability and enhance user experience for our developers — executing a command on Slack seemed a lot easier than triggering a pipeline from Jenkins clunky web dashboard.
Multi-phased progressive delivery
Canary deployments are just the beginning for us. In the future, we’d also like to have a way to perform A/B testing as a part of the same deployment pipeline. At the time of writing this, Flagger does not provide a way to segment a Canary deployment into multiple phases, each configured with its own testing strategy. Though this isn’t an immediate requirement for us, we‘re trying to look into how we could build this feature into Flagger.
Creating a safety net for testing out Canary deployments
Flagger has its own ways of automating and managing Canary deployments, which when implemented in live production environments had the risk of breaking incoming traffic (as proven from experience). For example, Flagger scales down the original target deployment and scales up a new one with the -primary suffix and patches the label selectors on your Service . And so the need for creating safety net to test out Canary Deployments came from the way Flagger modifies your existing deployment setup (sounds ironic but important).
To be clear, we did many PoCs and experiments with Flagger before deciding on taking it to production. But, we wanted to be extra safe while making such architectural changes. We placed an instance of HAProxy in front of our application and configured it to perform a traffic split between our Canary setup and the original application itself, with a small percentage of traffic (5%) going to the Canary deployment. The figure below illustrates our safety net setup.
With this approach, even if anything went down with our Canary deployment setup, 95% of the traffic would still safely reach the original application. To put it very simply, we “Canaried” our Canary deployment =).
Note that this safety setup would be temporary — after monitoring the performance and behavior of the system, we’d eventually remove this safety net and let the Canary deployment handle all the traffic.
Building a pipeline for progressive delivery has been exciting, yet full of challenges for us. There are not many resources or standards available for building progressive delivery pipelines in a Kubernetes-native way and we were hence tasked with building custom tooling and solutions that could integrate with existing tools. We’ve tried to open source most of these tools. (one can find the links to these projects below). Argo Rollouts and Flagger are both amazing projects and a big step towards bringing progressive delivery to Kubernetes.
While both the tools have their own set of pros and cons, it seemed like Flagger was better suited for our use cases. We did face some challenges around manual gating, but fortunately, were able to build solutions specific to these problems hoping that it would come in handy for other teams facing similar challenges. Currently, we are integrating this technology in few of our live services to analyze the value this brings to us, before rolling it out across the infrastructure.
- Flagger Slack bot — https://github.com/grofers/flagger-slack-handler
- Flagger kubectl plugin — https://github.com/mayankshah1607/kubectl-flagger
Mayank Shah interned with DevOps Team at Grofers this summer. You can follow him on LinkedIn.
Canary rollouts — how we’re trying to safely deploy to Kubernetes was originally published in Lambda on Medium, where people are continuing the conversation by highlighting and responding to this story.