Summer 2020 Intern Spotlight: Parallelizing Bulk Back Translation with Nikash

We had seven (remote) interns join us at Benchling this summer. We also welcomed back Karen and Rebecca from last year’s intern class as full-time engineers! In this post, we’ll dive into Nikash’s summer internship. For more 2020 intern spotlights, check out the blog posts highlighting Krystopher’s and Lilly’s experiences.

A little about myself

Photo of Nikash

Name: Nikash
School: University of Virginia
Fun fact: I love Cuban food! Guava cheese pastries, cafe con leche, Cuban sandwiches, fried plantains…you name it.
Team: Molecular Biology

Why Benchling

I had previously interned in FinTech and autonomous vehicles in my past few summers, and I enjoyed working on the unique problems being solved in those industries. That said, the companies I had worked at were relatively large and I found my work to be only incremental. I wanted to intern at a startup where my work would make a difference. I discovered Benchling on a list of top YC alumni and started reading more about the work they do. After I interviewed, my future manager Vineet walked me through possible intern projects, and I loved the ownership and flexibility I’d have with them. I’d be able to learn a ton, not only about software engineering and building products, but also about a whole other field of Biology and the R&D industry. Also, to be honest, Benchling’s logo was just too beautiful for me to pass up.

What I built this summer

Say I’m a lab scientist who’s been testing an antibody for COVID-19. I have a human antibody that seems promising for patients, and want to scale up production by making that same antibody in hamster cells. If I just use the DNA that translated to my human antibody, I might have a low yield in my hamster cells, since different organisms are biased towards different codons (or three letter patterns of DNA bases) than others. Instead, I can swap out certain codons in my DNA with other synonymous ones and improve the expression of my protein in the hamster cells (read more about codon optimization here!). Back translation is another way to approach this: scientists often start with the protein they want, and then determine what DNA they need to produce that protein. Scientists use codon optimization and back translation for a variety of use cases, which make them super popular products in Benchling (my teammate Karen implemented the initial versions of them when she was an intern, and wrote about them here).

While frequently used, codon optimization and back translation were both implemented as highlight-and-right-click tools in Benchling. Before my summer project, scientists would individually select regions of each protein that they wanted to optimize, and wait until the process was done before moving on to the next protein. One of our clients had manually performed thousands of optimizations in the few months before my internship started: that’s thousands of clicks and tons of research time wasted! Speeding up this process was heavily requested, by some of our largest industry clients and many more scientists in academia. My intern project was to build out bulk back translation, so that scientists can click a few buttons and optimize hundreds of proteins all at once! This was a common theme at Benchling: engineers across teams are working on customer-driven and impactful work, and are always hoping to achieve Benchling’s mission of accelerating the pace of the R&D industry.

I spent my summer implementing and parallelizing bulk back translation as an option on our search view. Scientists can now select all the proteins they want to optimize, and fill out a modal with various scientific parameters. Then, once they hit “Submit”, the proteins are optimized in bulk, and DNA is automatically saved right into a specified folder, all with only a few button clicks.

Screenshot of configuring a bulk back translation job
Scientists can select proteins and input codon optimization parameters into the modal, and optimized DNA sequences will automatically be saved, all at once.

We quickly realized that taking a synchronous approach and back translating each selected protein one after the other was not fast enough for the use cases that our clients wanted. Luckily, this problem was “embarrassingly parallel”. Since individual proteins don’t need context about each other, each optimization task can be done independently, and we can summarize the results for all the proteins once the parallelization has completed. I used Celery to support this, which is a great Python library for task queueing and parallelization. Specifically, I packaged my tasks as chords, which group a bunch of individual tasks, and perform a callback task to reduce over the results. With this parallel approach, optimizing 100 proteins with 500 amino acids each takes 45 seconds, compared to around 10 minutes if done synchronously! In the next few months, we’ll be moving the feature over to an AWS Lambda to increase the volume of optimizations we can support.

Diagram of parallelizing codon optimization in bulk
To improve performance, I parallelized individual codon optimization tasks, and reduced over the results with a callback task. I used Celery chords to support this model.

Hackathon

Benchling hosts an annual hackathon, where people from all across the company (not just engineers!) spend a couple days building out and demoing a product from scratch.

I love hackathons and am on the organizing board for my school’s annual competition, so I was excited to participate! My team, “Benchine Learning”, built a proof-of-concept for analyzing data in Benchling with powerful machine learning models. Users store an incredible amount of research data in Benchling, and we want to let them analyze and make predictions on this data to make informed decisions. My team created an easy-to-use interface for comparing multiple different machine learning models, integrated right into Benchling’s Insights application. I added in support for a few simple models using scikit-learn, a library for machine learning for Python. We demoed our product to the company with a breast cancer dataset, and showed off the high prediction accuracy we got with our models!

Working at Benchling

Going into the summer, the only context I had in Biology was from my 9th grade intro class. I immediately saw that what Benchling engineers shared wasn’t necessarily a background in Biology (only a few of the engineers I worked with had a life sciences background). Instead, we shared a need for continuously solving problems and learning new things, whether or not they’d be immediately applicable to the code we were writing. With that came a high level of collaboration between engineers. I had a ton of questions and faced many difficult problems throughout the summer, and everyone was always open to pair up and work through code together. Adjusting to a remote internship where I couldn’t walk over and ask questions to my teammates was not easy. That said, my team organized different ways to stay connected. We had daily tea times to wind down and chat, and weekly sessions to share context and learnings — I had never thought about the surprising complexity of calculating molecular weights, or the dangers of running massive PostgreSQL backfills! With these chats, I felt a sense of collaboration that I was pleasantly surprised to see persist in a virtual environment.

If you’re interested in interning or working full-time at Benchling, we’re hiring!

Thanks to Chanel, Naomi, and Somak for reading drafts of this. For more 2020 intern spotlights, check out the blog posts highlighting Krystopher’s and Lilly’s experiences.


Summer 2020 Intern Spotlight: Parallelizing Bulk Back Translation with Nikash was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Benchling