In this post, we’ll define distributed tracing, explain our motivation for adoption, and highlight a few scenarios where it can be helpful for operating software.
Once upon a time some 12 years ago, Strava was a single Rails app backed by a couple SQL tables. Fast forward to today, our infrastructure consists of dozens of microservices operated by a handful of product teams. This shift has enabled faster development; teams can design, build, test and operate systems independently. Each year, our development process grows less bottle necked by our rails app; and we continue to work towards the long term goal of completely obsoleting this monolith in favor of microservices.
This choice does not come without a cost, humorously alluded to in a tweet.
We replaced our monolith with micro services so that every outage could be more like a murder mystery.
Our server code was once contained within a single process. Debugging failures in this environment is straightforward and can be handled with metrics, logging, and profiling. With microservices, a single inbound client request could easily fan out to multiple services. Triaging failures in this environment requires a tool to correlate and analyze requests that transcend process boundaries.
At the time of writing, Strava runs over 60 microservices — many with their own databases and caches. This number will continue to grow, and we don’t expect our engineering teams to know the role of every service in our architecture, let alone remember all possible request pathways through these services. With distributed tracing, our engineers have a powerful tool in their arsenal to understand how the services they own fit into the bigger picture of Strava’s infrastructure.
Overview of Distributed Tracing
Distributed tracing is a mechanism to annotate requests flowing through a distributed system with metadata about all of the components that are involved in handling said requests. The key terms for tracing systems include, span, tag, and trace.
A span describes an operation. This may be a server handling a request, a cache storing data, or a computationally intensive function call. Tags enrich a span with metadata about the operation. Common tag values include an operation name, operation duration, failure information about any errors/exceptions encountered during the operation, etc. A span includes a tag defining its relation to other spans. A span can have a reference to a parent span, and any given span may be referred to as a parent by multiple other spans. These are referred to as child spans, and represent operations that were caused by the parent span, and might also be operations that must complete before the parent span completes. All spans bound by these parent/child relations will share a trace id. A trace, then, is the collection of spans with the same trace ID, and a trace describes the end-to-end request through a distributed system.
We’ll motivate further with an example. Suppose there is a system with 4 components; an API gateway, a supporting backend service, and a database with a cache. All requests to this system hit the API frontend, which dispatches to the backend service. The backend service stores all durable data in the database, which is supported by the write-through cache to boost performance. Each node performs an operation that can be described as a span. Moreover, some of those operations are responsible for others. These span form the distributed traces for client requests through the system.
In this contrived four node example, distributed tracing may be a bit overkill. However, as a product evolves and becomes more feature rich, a system architecture necessarily grows more distributed and complex and tracing becomes an invaluable triage tool.
Building a Distributed Tracing System
Distributed tracing solutions consist of three components; instrumentation, implementation, and collection.
Instrumentation is the application and library code written to describe tracing behavior. This comes in the form of language specific APIs that act as facades for backing implementations. Let’s take a look at a use of the opentracing-java API to instrument a computationally expensive function. For the sake of simplicity, there’s no error handling over the operation:
For a fleshed out example using OpenTracing, you can check out this Scala utility object we use in production. Most of our instrumentation code is baked into our internal Scala service libraries. This way, our engineering teams get trace coverage “for free” as a consequence of building new services on these libraries. We use Opentracing Java and Ruby instrumentation libraries throughout our stack. Other common tracing libraries for instrumenting Scala code include OpenCensus, Zipkin, and OpenTelemetry.
A tracing implementation backend is responsible for implementing the code defined by an instrumentation. Put another way, the instrumentation describes the tracing behavior in the language of your choice while the implementation provides this behavior in the same language. Implementations handle such things like starting and stopping timers for spans, managing span propagation across threads, buffering spans, and transporting spans to a collection layer. We use Lightstep’s OSS Java and Ruby implementations to power our trace instrumentation code.
A key value proposition for tracing is the ability to correlate operations that span across process boundaries. As such, trace implementations allow you to configure a “collector”. This collector is a separate process that collects and organizes spans from all source services into complete traces. In our infrastructure, all Scala and Ruby services send their spans to a single collector. With this collector running in our infrastructure, we can multiplex our trace data to various backends, such as a data warehouse for analytics, object storage for long-term bulk storage and data processing, or — most importantly — a third-party APM service for debugging and monitoring.
Distributed Tracing Use Cases
Once all your services are instrumented, you’ve chosen your tracing implementation[s], and you’ve set up trace collection, the real fun begins! Tracing is useful in many contexts, but we’ve only included the three most compelling use cases we’ve learned from a year of running our tracing system in production.
Service Dependency Graph
Many APMs offer handy UIs that visualize the dependency graph of a distributed system instrumented with tracing. This is a helpful tool for new hires. At Strava, engineering teams own subsets of our services. With a visual graph representation of our services web, new engineers can quickly discover what services their team depends on, and what services depend on their team.
Debugging Performance bottlenecks
A common visualization of a trace is the waterfall view. This representation highlights the ordering and duration of all operations throughout the trace. For traces with many nested spans, this visualization helps pinpoint which services in an overall trace are the biggest performance limiters.
When a service alert fires in a distributed system, two questions need to be answered: “What is the cause of this service failure?” and “what is the blast radius of this service failure”. Distributed tracing helps answer both. Many APMs offer functionality to monitor latency and error characteristics of services in real-time. When an incident is underway, on-call engineers will quickly see which portions of a service graph are in a degraded state.
Consider our previous 4-node example. Suppose an alert fires for the Api Gateway claiming the API is experiencing an elevated failure rate. Let’s suppose the root cause of these failures is actually a database failover. With distributed tracing, an on-call engineer could visually see heightened failures from the database propagating upstream via the backing service to the API. Without traces, it may take an engineer unfamiliar with the architecture of the system much more time to determine that the spike in the frontend API gateway is actually the fault of the database two service hops away downstream.
We’ve been running distributed tracing in production for over a year with majority coverage over every service in our system, and it has become an invaluable asset for our teams to operate software. The dependency graph of services provides on-boarding engineers a fantastic introduction to our complex microservice architecture. Almost routinely, our tracing data exposes clear, actionable inefficiencies in our systems and helps us quickly catch performance regressions. Most importantly, distributed tracing has improved the experience and capabilities of our on-call engineers, and our incident response has certainly improved since we’ve released tracing to our teams.