Summer 2020 Intern Spotlight: Improving the Code Deployment Pipeline with Krystopher

We had seven (remote) interns join us at Benchling this summer. We also welcomed back Karen and Rebecca from last year’s intern class as full-time engineers! In this post, we’ll dive into Krystopher’s summer internship. For more 2020 intern spotlights, check out the blog posts highlighting Lilly’s and Nikash’s experiences.

A little about myself

Photo of Krystopher

Name: Krystopher
School: UC San Diego
Fun fact: As part of my Senior Honors Thesis I wrote a Tetris solver for an NP-complete version of Tetris.
Team: Infrastructure

Why Benchling

Before interning at Benchling, I was finishing my undergraduate degree and tutoring a Discrete Mathematics course at UC San Diego. I was looking for an internship at a company that reflected a devotion to improving the lives of others and supporting their employees. With Benchling, I found a company that seemed to have this purpose, and it would allow me to move back home near my family as well. After speaking to some of the engineers, I knew this company would be a great fit for me. I was particularly interested in working on the Infrastructure team, which works on areas of software engineering that I was unfamiliar with before starting.

What I built this summer

Developers at Benchling are constantly writing new features and are eager to get those features out to customers quickly. To accomplish that, we use Continuous Integration (CI) and Continuous Deployment (CD) to release changes consistently and quickly. For my intern project, I worked to optimize and reform Benchling’s CI/CD process. The time it took to deploy code to our customers dropped from 45 minutes to 15 minutes on average after the deployment improvements. After reworking the CI/CD pipeline, we were able to increase the frequency of these deploys and make verifying changes much easier.

We deploy Benchling to our customers in parallel. Each deploy has a step to publish 3 Docker images. However, in order to reuse the work of constructing the Docker images, we only ran one publish step at a time and stored the result in a private Elastic Container Registry. Because Benchling has over 100 customers and the publishing was serialized, the time to deploy was more than (# customers) * (time taken to publish). In practice, this resulted in delays of over an hour while deploying to customers, which became a larger issue in situations where a hotfix was going out to customers. Additionally, this issue would only grow to be a larger issue as Benchling’s customer base grows.

In an effort to reach a constant deploy time, I moved the publishing of Docker images into its own process and ran that process in parallel with the tests that were run when developers wanted to merge code, before releases even started. The result was publishing once rather than over 100 times, and publishing in parallel so it contributed nothing to deploy time.

Before

Diagram of deploy pipeline before my project
This diagram shows the process of deploying to one customer. First logging for internal purposes, then constructing the Docker images used for the services, and finally deploying the services before notifying the customer.

To reduce the time that deploys took to run, I moved this publish step from the continuous deployment step to the continuous integration part of the process, running it in parallel with the tests we run before integrating. The result of this was a drop in deployment time, but also a time complexity drop to constant runtime O(1). In the case of hotfixes, the result was a similar drop in complexity and time.

After

Diagram of deploy pipeline after my project
This diagram shows the process of deploying to one customer after my project. Some of the work that was done in serial is now done in parallel, and the publishing occurs before we deploy to a customer, so it doesn’t contribute to deploy time.

After working on the deploy process, I next looked into improving the release process, which starts with releasing code to a test environment and developers manually reviewing their changes there to make sure they’re free of bugs. By rewriting the release process, I was able to double the number of deploys to test environments per day, make the verification process more stable, and empower developers to verify their changes quicker, allowing customers to see the product improvements quicker.

While working on my project, I tracked many of the metrics that I was trying to improve, like deploy time. I saw a tangible improvement in these metrics and shared them with the Engineering team.

Drop in deploy time (minutes)

This project was particularly interesting to me because it involved using a number of technologies that I wasn’t familiar with before starting. In addition to the various Terraform and AWS interactions I had, I became very familiar with the inner workings of Buildkite. Throughout the project I learned most of the features that Buildkite offers, as well as how to get around some of the constraints. As my internship continues, I will continue to iterate and improve on the CI/CD process while expanding my focus to working on other parts of the infrastructure as well.

Hackathon

Benchling hosts an annual hackathon, where people from all across the company (not just engineers!) spend a couple days building out and demoing a product from scratch.

During Benchling’s yearly hackathon I worked on improving the speed of tests which helped decrease deploy time and helped developers get feedback on the correctness of their changes quicker. The result of this was a drop in execution time from 28 minutes to 12 minutes on average. Additionally, I worked on updating some of the Terraform for Buildkite resources during Benchling’s “Bug Bash” and will be working on updating the process for deploying hot-fixes during Benchling’s “Infra Days”.

Working at Benchling

Starting at Benchling, I had little-to-no context on the technologies involved with running the infrastructure of a modern company. My classes at university had not come anywhere close to talking about Buildkite, AWS, Terraform, etc. Because of this, my first few weeks were very challenging. I was particularly concerned about every change I was making because I was unfamiliar with the technology and the changes made had the risk of breaking important parts of the infrastructure. With guidance and pairing with my mentor Kayne, I soon became far more comfortable with the process of updating the infrastructure and testing my changes, and after that the work became much more comfortable.

As I have continued to work at Benchling throughout the summer, I have come to appreciate the attitude of everyone on the engineering staff. Throughout every communication and meeting there is a consistent undertone of a group that cares about their work and is eager to learn. The result of this consistent message is a comfort in admitting mistakes, fixing them, and learning from them as well as in messaging people when you have quick questions or putting a meeting on someone’s calendar to ask them a few questions.

If you’re interested in interning or working full-time at Benchling, we’re hiring!

Thanks to Chanel, Kayne, and Somak for reading drafts of this. For more 2020 intern spotlights, check out the blog posts highlighting Lilly’s and Nikash’s experiences.

Discuss on Hacker News


Summer 2020 Intern Spotlight: Improving the Code Deployment Pipeline with Krystopher was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Benchling