Evolving CD in a cloud-native environment

In this post, we’ll discuss the stages through which continuous delivery infrastructure has evolved at Grofers. We’ll start with how we began with Kubernetes and Jenkins and managed its growing adoption. We’ll then discuss why we decided to move from Jenkins to Tekton, how we plan to scale it beyond a few teams, and what kind of challenges we have faced and are currently facing.

Background

As a team, we started working on continuous delivery almost 3 years ago. This was when product delivery cycles were long and unreliable. Testing was mostly manual against a designated staging environment to which we were deploying to using Ansible, Docker, and Packer. This staging setup helped us a lot as we tried process changes to improve our testing and deployments but over time we realized it was slow to work with and a single environment was a bottleneck when a couple of engineering teams are building on independent features that they want to test.

Migrating to Kubernetes

To improve the provisioning of staging environments, we started migrating services to Kubernetes. We built an in-house tool called “mft” which would help us create developer namespaces with the required setup on consul and vault tied to a Kubernetes namespace. We used Jenkins to deploy all our services that needed to be tested together in a namespace and ran our tests against them against all PRs and before deployment to production.

Migration to Kubernetes, itself has been a big journey for us providing us with its own set of wins, challenges, and lessons. We’ve talked about it in detail in a different blog post if you want deep-dive into it.

Standardizing Jenkins Pipelines

This approach got us to use Jenkins with Kubernetes heavily and as teams and projects adopted Jenkins pipelines, we started noticing a lot of repetition in the pipeline specs across projects. While copy-pasting Jenkinsfile was easy for developers to get started, sometimes it was a bit difficult for us to push out common changes either bug fixes or new features. This pushed us towards standardizing the implementations.

We started with Jenkins shared libs, and wrote common implementation for one group of engineering teams. This worked very well because not only were we able to consolidate and refactor all pipelines at once leading to several optimizations, it also made it easy to understand the CI implementation for all similar services and if we were to add common features and bug fixes it was really easy to push it through a common implementation.

CI pipeline and the shared library for representation

There were benefits of doing this, but what was not desirable is that it took us a lot of effort to build these shared libs and despite our efforts to keep them simple, they ended up looking very complicated. Standard pipeline specs had departed from being declarative in nature and there was a lot of imperative Groovy logic mixed with Pipeline DSL.

Shifting towards Kubernetes-native CD

While the Jenkins shared libs approach worked for us, we wanted to be careful with pushing it beyond a few teams. This was because we weren’t sure whether we wanted to continue using Jenkins or some other Kubernetes native CD platform.

Why did we consider this possibility? Firstly, while we’ve been running Jenkins for a while and learned how to scale it for the kind of usage we saw, it was a bit hard for us to learn and sustain. Managing Jenkins infrastructure meant we had to learn how to manage its capacity (both primary server and workers nodes), security, and change management. While we could find answers to all of these, and to an extent we did, given that we as a team were already managing these things on Kubernetes, a Kubernetes native CD platform would allow us to leverage that knowledge better. Operations aside, while copy-pasting was a way for developers to get started with Jenkins pipelines, if any modifications were needed, it would have been a learning curve for them. Our dev teams were already in the process of learning how to manage Kubernetes resources, so if they could manage their CD pipelines using the same fundamentals, it could potentially save them some effort learning how to do it on Jenkins.

Since Jenkins did not natively work with Kubernetes nor it had anything in its ecosystem that understood k8s natively, we had to write a lot of plumbing to make pipelines work with k8s the way we wanted them to work.

While we had options like Jenkins Kubernetes Plugin for Jenkins to bring it a bit closer to Kubernetes but that would have just helped us with one small problem of using Kubernetes Pods as execution agents. We’d still have to manage Jenkins as a platform, manage our shared libraries as a mix of Groovy scripts and Pipeline DSL and we’d still need to have our developers learn Jenkins.

With the above context, we started evaluating if there are new Kubernetes native CD platforms that could help us. This evaluation was necessary for us to make an informed decision on whether or not we need to move away from Jenkins and if so, then towards which tool. We looked at a bunch of alternatives, out of which Argo Workflows and Tekton were the top choices for our experiments given their feature set and community support. We had also looked at Jenkins X which seemed to do better on a lot of issues we had with Jenkins, but it seemed like an opinionated all-or-nothing solution that would require us to change a lot of things in our dev workflow to even get started with.

Through a couple of working PoCs, we concluded that while Argo Workflows and Tekton were quite similar when it comes to execution engine, Tekton seemed a better fit tool for CI use-cases for some of the following reasons –

  1. Argo Workflows has been battle-tested for running complex workflows, especially in the data engineering domain, and a lot of it is made possible through its advanced features and flexible structure. It’s possible to use these for building a CI pipeline as well. Tekton is less powerful in that way but its different types of resources, like Pipeline and Task, make it a little easier to visualize a CI pipeline instead of as a generic workflow.
  2. Because Argo Workflows has been used heavily in the data engineering domain, most of the community support is limited to that. There are some examples in the documentation and lately more focus on CI/CD related use-cases but it’s not comparable to Tekton which is primarily built for these use-cases.
  3. At one point, we were ready to ignore some of the above because given some time and effort, we were able to make things work with Argo Workflows, but when we compared the immediate user experience of the dashboards of Argo Workflows and Tekton, Tekton seemed less of a deviation from the experience our developers had with Jenkins or CircleCI (at their own peril). This is again where typed resources seemed helpful, as the UI was organized in a way that one could browse through different Pipelines, Tasks, and their runs separately instead of having a flat view of all workflow templates or all workflows as in the case of Argo Workflows.

Standardizing CD with Tekton

Standardization with Tekton two goals (as in the case of Jenkins):

  1. we wanted to make it easy for developers to build pipelines by avoiding platform-specific boilerplate
  2. from the ops perspective, we’d like to make it easy to control platform-specific code (ease of shipping new features and bugfixes)

Since pipelines were now Kubernetes resources, we had two ways of doing this:

  1. we build and maintain our own CRD abstracting away platform details and give the most relevant controls to developers
  2. We use standard tools like helm or Kustomize to control

Kustomize FTW!

Our team had some experience in building CRDs and custom controllers, so we saw it as possible but it seemed like overkill at the point when we were still trying to figure out the underlying platform. This was the reason why we went with Kustomize for what it does well, that is code reuse.

A sample kustomize manifest for representation. A sample base lives here — https://github.com/grofers/tekton-ci-starter/blob/main/ci-base/

For the most common use-cases, we ended up creating base resources in a central repository maintained by our team which could be referenced to by different projects. This, fortunately, had some desirable effects that we expected — the code required to get started with a standard pipeline was very minimal with only the parts most relevant to the project required to be Kustomized. A large part of their pipeline was platform level details and standard workflow that didn’t have to change for every project and most of it could be centrally managed by our team. Having Kustomize also allowed the developers to tweak larger parts of their pipeline if they needed to, but that’s where some learning curve is involved.

To override the bases, they’d need to read and understand the base implementation. Simple customizations were easy and documented as straightforward steps but beyond that developers would need to get into details of Tekton pipelines. However, this is sort of what we wanted, we didn’t want to take the flexibility or power of Tekton away from developers but we wanted to smoothen the learning curve which was made possible with this approach. The hypothesis is, if we’re able to get our developers started with Tekton for common use-cases easily, the advanced users can learn enough and even break free from our base implementations and build their own custom pipelines if needed from scratch. We were able to realize the full spectrum.

Hurdles that we’ve come across

The above specification is how we decided to expose Tekton to developers at Grofers for standard use-cases, it seemed promising but there were a lot of things that had to be dealt with behind the scenes. In this section, we’ll go through some of the problems we faced and how we approached them.

Do-It-Yourself CI/CD

Tekton by design is a very minimal, take-what-you-need system. This means to have an end-to-end CI workflow operational, there are multiple components that you need to learn about and set up. For example, to set up a CI pipeline that triggers upon every change pushed to a Github pull request, one needs to set up an EventListener, TriggerBinding, TriggerTemplate, and of course a Pipeline.

To make the setup a bit generic, we decided on a convention for naming our Pipelines and used that convention in our TriggerTemplate. This allowed us to re-use a single instance of EventListener, TriggerBinding, and TriggerTemplate to run CI pipelines across several projects.

The high-level design of our standard CI pipeline

While these components are designed to bring a lot of flexibility, they can be a bit daunting when teams get started with Tekton, especially if they’re used to something which provides all of this out-of-the-box like Jenkins.

Resources that form a part of our setup explained above are available in this repo — https://github.com/grofers/tekton-ci-starter. They can be used to understand things better and hopefully make it easier to get started with a setup like ours.

Pipelines-as-code workflow

With Jenkins, we had gotten used to the practice of managing all of the pipelines as code in the project repo itself in Jenkinsfile. This practice paid off very well as it made the pipelines better managed. In-place editing was gone and all changes were made through pull requests. Moreover, pipelines were portable — there were occasions when we moved from one Jenkins instance to another or when pipelines were deleted from Jenkins by accident and it was very easy to create them from code. On top of this, it made testing pipeline changes very easy with multi-branch pipelines, and this is something that we were about to miss in Tekton.

While it was possible to manage the pipeline as code, the capability to test changes in the pipeline in a different branch is hard to test on Tekton. Technically, developers could do it by creating their pipeline resources by different names or in different namespaces but we anticipated that it’s going to be too much work given how seamless was the experience that everyone was used to on Jenkins.

There was no clear answer to this problem but it was important for us to figure out a workaround to move forward and this was so important for us that we were ready to scrap the idea of moving ahead with Tekton if we couldn’t find the answer to this problem. Thankfully with a bunch of experiments, we were able to build this workflow with Tekton. It involved us writing a custom web service that would sit in front of Tekton EventListener and intercept events from Github. This service would then create a new namespace, create all Tekton resources present in the .cifolder of the repo and run the pipeline (if it was created) for the incoming change.

The high-level design of our pipeline-as-code setup. Tomahawk is the custom webhook listener.

This worked really well and apart from the hesitation of writing and managing our custom components, we didn’t really have any reason to not use it.

Getting used to the experience

Once a few teams got their pipelines on Tekton we heard a few use-cases that teams were quite used to. A typical CI pipeline of a project would have a step for deploying to a dev environment. While we covered this in the automated CI pipeline that runs for a PR and deploys to an ephemeral namespace, teams needed a way to deploy any specific branch to a specific namespace which was typically done on Jenkins via parameterized builds. Parameterized builds on Tekton were not straightforward especially because our Pipelines were dependent on Workspaces which were basically persistent volumes and they were declared in the PipelineRun in the TriggerTemplate used by Github EventListener.

We could ask developers to keep some PipelineRun specs handy but this was again a big deviation from the convenience that Jenkins provided. At least for common use-cases, we wanted the simplest possible trigger like deploying a service to a long-running staging environment. We had the capability of Tekton webhook EventListener through which we could trigger Pipelines — we just needed a user-facing interface. Since we use Slack heavily, we ended up building a simple slash command to deploy a specific branch to a specific namespace of any service that follows the standard workflow.

Interacting with Tekton slash command

This was well received and perhaps seemingly even simpler than using parameterized builds on Jenkins.

Tekton’s dashboard is very slow

The way the dashboard works is by querying Kubernetes APIs. That means there is no proper pagination, which makes the situation very difficult when you run a large volume of pipelines. There are some discussions about this in the community with a promising solution being the work on the Tekton/results project. This is supposed to be a service that would store the results of the pipeline runs which would then be fetched by the dashboard instead of being fetched directly from Kubernetes APIs. The results backend would solve the problems by supporting more efficient querying. Here’s us waiting.

Some other rough edges

While the community is evolving around Tekton along with its feature set, it’s still a young project. You'd find several big and small features in a mature CD project like Jenkins are found missing. On Jenkins for example, we’ve been using a Github Oauth plugin which provides a very functional role-based access control based on Github teams. There’s no way of doing this on Tekton, the last time we checked.

Besides, Tekton is an actively developed project, which means things can change at a fast pace. One of the examples of this is the construct of Pipeline Resources. When we started, we structured our workflows around Git and Pull Request type of Pipeline Resources but later found out that Pipeline Resources won’t be supported beyond the alpha version due to several reasons. Frankly speaking, even we found Pipeline Resources to be a bit confusing and preferred using alternative Tasks to achieve similar functionality but we had to incorporate these changes in our Pipelines.

Conclusion

Our continuous delivery journey, from Jenkins to Tekton and from EC2 to Kubernetes has been full of challenges, and why would it not be given how much it has changed and how many factors are at play when we think of scaling these practices for more than a dozen of fast-moving product engineering teams. None of the changes that we’ve made have worked perfectly. There hasn’t been one solution that has satisfied 100% of the needs of all teams. However, the sum of our efforts has contributed to a lot of improvement in developer experience and speed. Standardization has proved to help build commonly used workflows without repetition of code and effort in learning about platform details while giving the power to platform teams to drive changes faster at scale.

When it comes to Tekton specifically, there are some challenges that we face currently, but fortunately, most of our big hurdles have workarounds. We sometimes face situations where there may not be a straightforward parallel to a Jenkins UX, but then those are situations when we’re pushed to rethink the use-case towards making it simpler. The Slack interface and release-based deployments are examples of that. A tweet from @kelseyhightower is worth quoting here –

There is no single continuous integration and delivery setup that will work for everyone. You are essentially trying to automate your company’s culture using bash scripts.

We could easily say that this is becoming our guiding principle when it comes to solving problems in continuous delivery and developer experience at Grofers.


Evolving CD in a cloud-native environment was originally published in Lambda on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Grofers

Leave a Reply

Your email address will not be published.


*