Migrating Segment Efforts Storage at Strava.
Earlier this month the Strava Foundation Engineering team completed a multi-year project to handle the rollover of activity IDs into a 64 bit integer format. This work culminated with a final year-long project to move read and write access of segment effort data from our monolithic Rails app to behind a service, a necessary prerequisite to allow us to safely and easily to change the underlying segment efforts storage system to be compatible with 64 bit activity ID values.
This blog post details the work which went into the segment efforts migration, a multidisciplinary project requiring engineers to design and develop a brand new high-volume service, and then patch it into many crucial product workflows.
As activities are created by our athletes, we assign a monotonically increasing (by one) identifier value to each activity. This integer ID uniquely identifies the activity in our data stores and appears as a key (both primary and foreign) in many tables throughout our data systems.
While most storage locations of the activity ID value use a 64 bit int column type, older systems used an unsigned 32 bit int column type. As it became obvious that the activity ID value would exceed the maximum value of that 32 bit column type (4,294,967,295) in the not-so-distant future, we realized if we did nothing, inserts with IDs greater than the max would fail and Strava would grind to a halt. Our best guess was this rollover would occur sometime in early 2021.
To stave off this catastrophe, we needed to change the schema of all tables with a 32 bit activity ID column to the 64 bit type. Throughout most of early 2019 we successfully performed many of these migrations. But in July 2019 when we attempted to migrate our largest table, segment_efforts, those migrations failed due to the sheer size of the data. segment_efforts holds a single row per unique segment traversal at Strava, and at the time of our attempted migration, the table held ~70 billion rows. And on a typical busy weekend day we were adding well over 100 million rows.
Undeterred, we tried a handful of different online schema migration tools and techniques to do the migration, but all of them failed. Generally, most schema migration tools create a full copy of the table with the new desired schema, and use some sort of asynchronous process to backfill and then stream live updates into the target table, before an atomic transaction to cut over to the new table is made.
segment_efforts is so large that we literally did not have capacity on our host to do the full copy. And because the database was hosted via AWS Aurora, we were limited by the max instance size available. Adding to our pain was the fact AWS charges for I/O on Aurora (even within the same availability zone), so each failed migration attempt resulted in a real monetary cost to the business.
Unable to migrate the column type, we also looked at creative SQL-based solutions to do the “migration.” These included creating a new table with the correct schema, and then trying to UNION them in SQL with a view. We also tried adding a new NULL activity_id_2 column with a plan to migrate data into it gradually. Neither solution worked.
New Data Store
The goal with our migration was to do it “in-place,” keeping the same schema for clients who are currently interacting with the table. This is preferable, as it requires no code changes from clients to continue to access it after the migration in completed. Unfortunately, by late August 2019, it became apparent an in-place migration of the existing segment_efforts table was impossible, as we’d have to figure something else out.
The only real option at this stage was to create a new data store for segment efforts that handled 64 bit activity_id values. Once deployed, we could start a long and gradual process to migrate traffic to it, eventually having the new data store take over all responsibility and deprecating the old store.
We also knew we had the option of working with a data store which was not MySQL. While MySQL had worked well for us so far, it did have some downsides — namely a poor primary failover story, and the inability to do schema migrations at such a large size (the whole reason we are in this mess to begin with).
With this in mind, we began a long research process to see which potential data stores met the needs of our application. We set up some criteria for our evaluation we wanted with a data store:
- High query concurrency. segment_efforts sees peaks at ~6,000 QPS, with a 2:1 read to write ratio . The data store must be able to support this scale.
- Queryability. While the majority of queries fall into known patterns, there is a long tail of query patterns made by clients. We must be able to support all queries, though we may be okay with higher latencies for the less popular ones.
- Online schema migrations. The whole reason we’re in this pickle is that MySQL does not support online schema migrations, so it would be great to find a data store that did that.
- Easy and predictable scalability. As our traffic grows, our database needs to be able to easily scale to meet that traffic. Ideally, adding more nodes increases performance, and that increase is predictable and easy to do.
- Disaster recovery. While segment effort data is technically replaceable, doing so en masse would be a herculean effort of time and resources. Consequently, if we experience some sort of catastrophic data loss or other problem, our data store must be able to be recovered with data in a complete or near-complete state.
- High availability. Creation of segment efforts is a critical workflow at Strava, and as such the database must maintain very high availability. Today, a failover of our MySQL database for segment efforts, albeit infrequent, results in a 5–10 minute outage of requests while a standby replica is promoted. It would be great to use a more available data store.
We evaluated many potential data stores. Some were more OLAP-focused (e.g, Clickhouse), some were distributed SQL-like/NoSQL databases (e.g., Cassandra, DynamoDB), some were distributed SQL databases (e.g., CockroachDB), and some were more savvy uses of SQL (e.g., Vitess). Evaluations and proofs of concept for these data stores took time. For candidates which seemed most promising, we wanted to deploy them into production to test them against actual production workload.
This, however, presented a problem. All current access to segment efforts was direct SQL access via MySQL clients to the segment_efforts table. If we wanted to test client traffic against a data store that did not present a MySQL-compatible client interface, clients would need to refactor their code to work with the new data store.
Furthermore, the majority of segment efforts access was from our monolith Rails application, which had accumulated over 10 years of code presuming direct SQL access to the table. We estimated it would take at least 3–6 months of work to refactor all of its code to work with a new segment efforts data store. We did not think it was a good use of resources to do that kind of refactor more than once.
Segment Effort Service
Given all this uncertainty, we made a strategic decision to do the following steps:
- Front all segment effort access with a service. Migrate all clients to it.
- Choose a new segment efforts data store.
- Programmatically “bond” both the old and new stores via the service.
By moving segment efforts access behind a service, we can have that service present a single unified interface which clients use to read or write segment efforts. As long as that interface is kept stable to outside clients, we’re free to change the underlying data store implementation easily, without coordinating that change with those clients.
Design & Development
The segment efforts service follows a familiar pattern of services at Strava. The service defines a Thrift service interface, which clients use to read segment efforts. Designing this interface was a little bit challenging, as we needed to give a rich interface to allow clients to replicate their existing SQL queries, but not so rich as to find ourselves reimplementing the SQL syntax in Thrift.
To help inform the interface, we leveraged some internal tooling to profile SQL traffic. From this analysis, we were able to see what our query patterns were, and thus optimize the interface around those access patterns. Most of our query volume was able to fit into a relatively simple interface. For the queries that did not initially fit that interface, we were able to refactor the application code and then leverage some in-memory filtering or sorting to still achieve the desired results. This may have been less efficient than leveraging the database to do that same sorting and filtering, but these queries were infrequent enough that it was worth the tradeoff.
Writes of segment efforts use an event sourcing model, where all pending write operations are enqueued by clients as commands into a Kafka topic for a worker to process. This centralization of all writes reduces concerns over race conditions which can occur during concurrent writes by distributed actors. With this coordination, the worker executing those writes is able to itself log correctly ordered messages of the writes it is making. These events are further consumed by other downstream services, e.g., the service which updates local legends data.
As parts of the service were ready for use, we started to refactor existing client access to use it. With our monolithic Rails application, this was incredibly challenging as it had many nests of post-save callbacks, metaprogramming, entanglement with other business objects, existing foreign key relationships, and other hidden or bespoke behavior. Porting all this access to Thrift calls took a ton of effort, taking 1–4 engineers working (mostly part-time) nearly 6 months.
Towards the end of service development, we turned back to evaluating potential new data stores. By this time, it was spring of 2020 and a lot was going on. With Coronavirus and the worldwide shift to sheltering in place at home, Strava saw a sudden and large increase in the number of activities uploaded. This increase in workload stressed quite a few internal systems, requiring us to shift resources to deal with those performance issues. Cruelly, the increase in the number of uploaded activities accelerated the timeline to the 32 bit activity_id rollover, with the estimated rollover occurring sometime in October 2020, approximately five months earlier than before.
The full evaluation and decision-making process for choosing a data store is outside the scope of this blog post, but the short story is this: no data store checked all of our boxes. We found some stores offered amazing scalability and queryability, but could not handle our write load (mostly DELETEs) without sacrificing availability. Stores which did the opposite — handle the write load and maintain availability — generally suffered from poor queryability.
It was at this point where we had a decision to make. We had, at best, five months of runway left on our activity_id column before the rollover happened. Yet no alternative data store really was meeting all our requirements.
This fact is not really that surprising. Like most things in life, data stores are all about tradeoffs. If there was one data store that did everything well without tradeoffs, everyone would be using it. So in reality, choosing a data store is about understanding and managing your tradeoffs. We needed to really understand not just what characteristics we wanted in a data store, but also how important each of those characteristics were to us.
It’s all About Tradeoffs
With this in mind, we went back to our original list of requirements and thought about how important each one was to Strava. We discovered a few interesting things.
First, while the occasional MySQL failover is unfortunate, it’s not really a huge pain point for us. Generally we see maybe 1–2 failovers per year, the outcome of which is 5–10 minutes of downtime. While not ideal, it is something we are tolerant of, at least today. Furthermore, we’ve done a lot of work lately to improve the reliability and durability of the upload processing pipeline, which has helped to insulate it from failovers. Uploads received during the failover are buffered, and are able to resume processing once things are available again.
Second, ease of schema migrations turned out not to be that useful for us. In talking with product teams internally, there were few to no schema changes that they wanted to make. We did not hear from any teams about wishing they had alternate ways to index efforts, nor desires to store additional data about efforts in new columns. Most of their data needs were for aggregations or other derivative data sets built off of segment efforts, not changes to the segment efforts schema itself.
Instead of adding indexing or columns directly to the segment_efforts table, most product teams have built their own segment effort-based derivative data sets (i.e. a segment’s local legend, an athlete’s biggest climb, a segment’s leaderboards, etc) stored and indexed in their own storage systems, independent of the segment_efforts table. This is a common data strategy within Strava, primarily done to maintain service isolation between data services, but also allowing data stores to focus on indexing data to meet their needs, vs having to add columns and indexing capabilities to a single centralized segment_efforts table.
The Boring Option
Understanding these tradeoffs, the most logical option for our data store problem ended up actually being the most boring one: we chose MySQL. And not like a more savvy sharded MySQL solution (i.e., with Vitess) — we just ended up creating a second table, literally called segment_efforts_2, with the correct 64 bit activity_id column.
The reality of the situation was that we had, at best, 5 months left on our timeline. If we chose a new data store we’d have to fully productionalize its deployment (alerting, observability, etc) and test it enough to feel confident in it within those 5 months — certainly doable, but risky. With MySQL, we knew exactly what we were getting into. We have many years of operational experience with it, know its characteristics and know it can perform the job well. The main disadvantages of MySQL — poor failover story and no online schema migrations — actually ended up being of lesser concern for us.
That said, eventually we’re going to need a new segment efforts storage system. Continuing with MySQL does not really advance the storage capabilities of segment efforts. If we were to have 10x the rate of segment effort writes, I am not sure a single MySQL instance (no matter how large it is) would be sufficient. That world is a ways off, however, and we have a more urgent and very real deadline to make with the activity_id rollover.
Furthermore, if at a later date we decide to use a new data store, that will not be made any harder by having two MySQL tables as opposed to one. And in fact, because segment effort access is well abstracted behind a service, changing storage systems is much easier than when we first started this project.
It’s certainly very anticlimactic to have done all this work of creating a service and all the work in Rails to port the logic, just to effectively keep using MySQL. However, if we didn’t go through with the service, clients would have had to add the bonding logic in every place where they access segment_efforts. That is a lot of places for duplicate implementations of the same logic, and for that logic to get out of sync, to have bugs, etc. And since we’re already touching all these places in code to add that logic, we might as well just have a service to encapsulate that logic and change those places to interact with the service instead.
As an added benefit, since our backend services ecosystem and tooling is more powerful than what exists in our Rails application, by moving things to a service we made most segment effort workflows faster, cheaper, and more accurate than before. And we’ve enabled us to easily and safely make future changes to segment efforts logic or storage without the clients having to do anything else.
A lot of work on teams like our Foundation team is to take existing projects or systems others may have created in the past and fix emerging problems with them. Sometimes those problems are simple, but sometimes they are hard. This was one of the hard times.
It’s very likely when the segment_efforts schema was created in 2009, those engineers never thought about or knew about this problem. Or maybe they did, but figured “by the time we have over 4 billion activities we’ll have a new segment effort storage system.” Or maybe they just figured we’d be able to run one of the online schema change tools that had always worked for every single schema change Strava had ever made up until that point. It’s hard to say.
Building software is all about making good decisions and tradeoffs given the data you have available. Sometimes you know about and understand those tradeoffs at the point when you make decisions, and sometimes you don’t. But one thing is certain: every piece of code you’ve written, and every technology choice you’ve made will eventually have to be dealt with by someone.
You may not be the one dealing with those problems, but someone will. As best you can, do as much work ahead of time to understand these tradeoffs and flush out ones you don’t know. Especially for consequential things which are hard to change, like database schemas.