Benchling’s double-writes approach to incrementally adopting GraphQL

Benchling is a platform for life sciences R&D. What enables Benchling to provide a true platform for biological research is the data model: its customizability, extensibility, and interconnectedness. Biological data is inherently difficult to model given its complex nature. For a given experiment, a scientist may need to track not just experimental results associated with the sample they’re analyzing, but also keep track of its ancestry and lineage, downstream experiments, and even physical storage location. On top of these needs, Benchling also aims to support customers in a wide variety of scientific verticals — from cancer therapeutics to food science to materials and more — and as such, we need to provide a data model that can be tailored to specific kinds of research.

This customizable data model presents us with our fair share of technical challenges. Our backend API was originally a REST-like backend-for-frontend. This means that while most of our endpoints are REST verbs oriented around a single model type, we sometimes also include nested objects into the response for convenience or performance. As a result, when users request simple tweaks like “could we also show this related data here?”, engineers suddenly have much more work on their hands. Is that data already served by one of the endpoints I’m calling? Could I just add it, or would that break its performance contract? Should I create a new custom endpoint for all of this view’s data, or chain a preexisting endpoint? Is this tweak even worth the effort?

These questions ultimately distract from what we aim to do: build a product that scientists love to use. To make these changes as simple as they’re intended to be, we decided to adopt GraphQL and Apollo, which have proven to be much better suited to consuming our complex data model than React-Redux or our homegrown frameworks. Incremental adoption was challenging, in part because of the complexity of our data model, and in part due to other constraints detailed below.

In this blog post, we’ll describe how we adopted GraphQL. We’ll first dig into the considerations that both made GraphQL appealing to us as well as tricky to adopt. Then, we’ll dive into the system we built, a behind-the-scenes double writer that mirrors writes between our Apollo cache and Redux, and ultimately narrows a developer’s field of vision to a single React component at a time. Finally, we’ll share some key takeaways and learnings from this process.

Why GraphQL?

Let’s take an example: the Notebook, a virtual representation of a scientist’s paper lab notebook, similar to note-taking tools like Google Docs. The Notebook is a scientist’s main entry point into Benchling. A single Notebook entry is a place for recording both qualitative and quantitative data associated with a given experimental flow. Within a Notebook entry, a scientist might record data against many biological samples that are contained within the same test tube or 96-well plate.

An example Notebook entry in Benchling

Notebook entries can also be reviewed by collaborators, similar to a code review. In Benchling, we need to model not just the data captured within the Notebook, but also information about the collaborator review. In REST, a Notebook entry and its associated review information would probably be served by separate endpoints, but in GraphQL, we can easily request any additional fields we need:

https://medium.com/media/6a84436fa16dc3dc39196d88c519206f/href

It’s important to recognize that much of GraphQL’s value comes from a properly designed schema. If, instead, we autogenerated the schema to match our REST API, we wouldn’t see the same value:

https://medium.com/media/3fa38d53016cc20e03493bd3036715b2/href

Constraints for the migration

When evaluating a new core technology, it’s tempting to imagine your stack with all of your legacy code magically replaced. Of course, the reality is that feature development won’t pause for you to do that, and you’ll not only need to incrementally adopt the technology, but also be sure that the in-between state is acceptable for a long period of time. In our case, the same data interconnectedness that makes GraphQL a good fit for us also made it challenging to adopt it. To be sure we made the right decision, we set the following constraints on our migration.

  1. Maintain the same user experience throughout our migration. This ruled out approaches that involved fetching the same data through both Redux and GraphQL at the same time, because that would likely degrade performance. It also meant that if we did store the same data in both Redux and our Apollo cache, that data must stay in sync. Otherwise users could easily run into situations in which the same data had different values in different parts of the app. As you can imagine, scientists get quite nervous when their data behaves unpredictably.
  2. Avoid making any React components more complex. Because our ultimate goal for GraphQL is developer velocity, we wanted to avoid introducing new complications. This ruled out using both GraphQL and Redux in the same React component, because fetched data is often dependent on other fetched data, and stringing together dependent fetches across multiple frameworks sounded like a headache.
  3. The GraphQL schema should be properly designed and free of baggage. Besides a few important tactical benefits like declarative data fetching, most of GraphQL’s value comes from a properly designed schema. The GraphQL schema is meant to be an artifact that describes how we think about our product today, and is able to evolve to match future use-cases. As mentioned above, approaches that autogenerate the schema would likely lead to long-term complications we wanted to avoid.
  4. We wanted to see value quickly. If it would take years for a majority of the team to use GraphQL, we’d have been better off improving our prior stack. This is tricky in part due to the constraints above, and in part because of that data interconnectedness. As a scientific data platform, even totally new products still generally need data from other parts of our app, so GraphQL wouldn’t be nearly as useful without backporting most of our preexisting data.

The Double-writes system

The approach we landed on was a system that turns any write to either our Apollo cache or Redux store into a write to both. This way, preexisting Redux-based components automatically populate the Apollo cache and new or ported GraphQL components do the same for Redux. Ultimately, this means that instead of worrying about potential breakages across our app, developers focus only on components they’re touching.

In the abstract, keeping two caches in sync is hard to get correct. Frontend state management is notoriously complex due to the number of possible states and interactions. A frontend developer needs to consider whether data is loaded, which fields are relevant, and how data is interconnected (implicitly or explicitly). These concerns compound when considering that any number of components on the page may use or modify that data at any time. We were able to simplify the problem with two observations about our app in particular:

  • Most models only have one or two frontend representations, and they’re stored separately from each other. This is perhaps the key insight — mapping each representation to a GraphQL fragment (a reusable query chunk) reduces the state space down to “which fragments by ID has the store loaded?”
  • Only the component that triggered the fetch tends to care whether data is loading, so that re-rendering the fetching component doesn’t keep triggering fetches while the first request completes. This means we don’t additionally need to globally store which fragments by ID are loading.

Given these, we only need one goal: for each React render tick, the Redux and GraphQL state must be consistent. State is consistent if the same data is stored in both data caches for each GraphQL fragment and REST pair. Notably, when a model had multiple representations, we didn’t need to keep those in sync, because our prior app didn’t either. As we mentioned above, our goal was simply to maintain the user experience, and not get too hung up on fixing every quirk.

The rest of this section dives into how our double writes system works. We start with a system overview, then highlight two key pieces of the design. Finally, we’ll walk through the two critical paths of data through the system — a Redux component rendering a GraphQL component, and vice versa, to illustrate the system in practice.

System overview

Double-writes system overview

At the core of this system are two key modules: FragmentWatcherLink and a number of Syncers. FragmentWatcherLink, an Apollo Link (plugin), listens at the network level for fragments (reusable query chunks) from the server. It then forwards those fragments to their associated Syncers. Each Syncer maps a specific fragment to a specific piece of Redux data and defines how to read from and update Redux.

BenchlingQuery is our wrapper around Apollo’s Query API. Mostly this overrides a few defaults, but it also plays an important role in blocking rendering until Redux has settled. We’ll dig more into that in the GraphQL → Redux data flow below.

FragmentWatcherLink

FragmentWatcherLink listens for completed queries at the network layer and extracts all fragment-data pairs from the response data, including nested ones (a fragment may be composed of other fragments). This is a surprisingly intricate problem, and you can see the implementation in its entirety here.

For example, consider the following simple GraphQL query and result. (Note that a REST endpoint exists which returns similar data in a slightly different shape.) Each of the fragments — Notebook, ReviewSummary, and ReviewComment — has its own place in the Redux store. FragmentWatcherLink’s job is to extract objects for each fragment, and forward them to the associated Syncer.

https://medium.com/media/10109b3b0e5b013126427ef887575d0f/href

And the response:

https://medium.com/media/8bdfd13fff488b173a5335e204e45f1d/href

Here, we extract four objects, to be forwarded to three Syncers (one Syncer per fragment):

  • Notebook with ID note_74230
  • ReviewSummary with ID rvw_13298
  • ReviewComment with ID cmt_76495
  • ReviewComment with ID cmt_98735

Listening at the network layer leaves an important gap, however: because we only listen to fragments from the network, we don’t capture other changes to the Apollo cache, like direct writes. We took that approach because Apollo 2’s cache doesn’t have a way to listen to fragments by ID. Once we’ve upgraded to Apollo 3, we’ll investigate whether we can replace FragmentWatcherLink with something built-in to the Apollo cache.

Syncers

The Syncer is the intermediary between the Apollo cache and Redux. For each fragment to be synced, we define how to extract the corresponding data from Redux, and how to dispatch the corresponding action to update Redux.

Below is a simple example for our Notebook type. This is part of a larger map from fragment name to a type-safe Syncer for that model.

https://medium.com/media/84c81e3e77c6e44215f16249dab9f20d/href

This part of the system is where much of the complexity lies, and that’s by design. Mapping one production data store to another has a ton of nitty-gritty intricacies, and our goal was to hide that complexity from the React components. In fact, translating from GraphQL-shaped data to REST/Redux-shaped data is an involved but well-enough scoped problem that we built a separate system to handle that. That’s a topic for its own blog post, so let us know if you’d like to see that!

Data flow

Now that we understand the key pieces of the system, let’s look at how data flows through it. Below, we’ll consider the two critical data flows.

  1. A Redux component fetches data and then renders a component that reads that data from the Apollo cache
  2. A GraphQL component fetches data and then renders a component that reads that data from Redux

Redux → GraphQL

Starting with the simpler flow, consider two React components:

  • ContainerComponentRedux: fetches data and stores it in Redux, then renders PresentationalComponentGraphQL
  • PresentationalComponentGraphQL: a component that previously read from Redux without fetching, and has now been migrated to GraphQL. Because the parent component populates the Redux store, and syncers forward that data, this component should be able to read from Apollo’s cache without using the network.

Note that although the Presentation component shouldn’t do network fetches, we also don’t want to use Apollo’s barebones cache API (e.g. readQuery). While directly using Apollo cache reads and writes would be a more exact refactor, changing data needs down the line would require a second refactor for Apollo to manage the fetching as well.

This diagram shows how data flows through the system:

As an optimization, when we previously fetched data with Redux, we always dispatched a single bulk update for each model type. This means that whenever a Syncer detects a change in Redux, it can naively grab the data, translate it, and write it to Apollo. The GraphQL → Redux flow, on the other hand, is slightly more complex because we don’t have a similar grouping mechanism.

GraphQL → Redux

For the more complex flow, we again consider two components:

  • ContainerComponentGraphQL: now uses Apollo to fetch a GraphQL query and renders PresentationalComponentRedux.
  • PresentationalComponentRedux: reads from Redux without any network requests, and hasn’t been migrated to GraphQL yet.

In this flow, we additionally need to debounce updates for performance. Apollo calls FragmentWatcherLink once per query, and we often fire many queries per network request (see query batching). Without debouncing, a single network request could trigger many updates, meaning many re-renders, which would dramatically degrade our app’s UX.

Once these two flows work, we can be sure other render sequences will also work. For example, these flows support a use case where one component fetches Redux data, and then much later another component happens to request that data via GraphQL. Trivially, the flows also support two components that both fetch Redux data. The one exception is when two components request the same data from Redux and GraphQL at the exact same time — given how rare this is in our app, we decided it was acceptable to fire a duplicate network request, and instead prioritize migrating those components to GraphQL.

Takeaways

So where are we today?

  • ~1/3 of our 1MLOC frontend uses GraphQL, including multiple products and expansions that are pure GraphQL.
  • ~3/4 of our GraphQL queries rely on data that existed at the time of adoption. We wouldn’t have had nearly as much usage if these queries had required fully porting Redux first.

Naturally, a migration this complex has plenty of learnings. Here are the most valuable lessons we took away:

  • The double-writes approach perhaps worked a little too well, since we haven’t made a lot of progress removing Redux. In other approaches, teams usually have natural incentives to remove that type of tech debt, but in this approach, the complexity is mostly abstracted away for them. This has meant we’ve needed other non-technical ways to encourage clean up.
  • Because the abstraction has worked so well in many cases, we neglected failure states, leading to opaque and confusing error messages. At the end of the day, it’s meant to be used by humans, and prioritizing developer experience in the “unhappy path” would have empowered devs to fix their own bugs more often.
  • Relatedly, we should have leaned more on linting and other automated checks. For example, lint rules that let a developer know that they need to implement a Syncer, or that they need to query for a specific data shape to trigger syncing. Instead, we’ve needed to catch these issues in code review, which can be time consuming for us, and frustrating for them.

Innovative technologies like GraphQL can create step-function increases in developer velocity, but in the real world this is usually contingent on finding a practical adoption path. In our case, adoption was tricky due to the constraints we set: preventing regressions in user experience and developer velocity, ensuring our schema is designed for long-term success, and seeing value quickly.

We were ultimately successful in meeting our constraints thanks to our double-writes system, one that keeps Redux and our Apollo cache in sync with each network request. We map our Redux state to GraphQL fragments when Redux changes, and similarly fire Redux actions when a GraphQL query completes. This encapsulates the complexity of translating one cache to another within the syncing layer, which allows developers to write natural GraphQL components.

We’re hiring!

If these kinds of technical challenges appeal to you, we’re hiring! We recently closed a $200M Series E to continue expanding our product and accelerating life sciences R&D.


Benchling’s double-writes approach to incrementally adopting GraphQL was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Benchling

Leave a Reply

Your email address will not be published.


*