Distributed tracing remains one of the most important features of any tracing system. Nearly a year ago, we announced Elastic APM distributed tracing, let’s take a look at how this useful feature works behind the scenes.
Over the past few years, many applications have adopted microservice architecture. Each of the services in a microservice architecture can have their own instrumentation to provide observability into the service. However, since all of these services work together to fulfill a request, it is often desirable to look at the trace as a whole, this is known as distributed tracing. But how does a tracing system provide a unified trace that consists of multiple services running on different machines? The answer is context propagation.
The need for context propagation
In order to achieve distributed tracing, each individual service needs to be able to communicate with upstream services and provide information that uniquely identifies the current trace. This is known as context propagation. APM providers, for the most part, have created their own context propagation mechanisms. However, there are a number of benefits with adopting one unified format of context propagation:
- Multiple APM vendors can be used to monitor the same microservice architecture without losing observability.
- Network software (proxies, load balancers, etc.) can automatically correlate logs with the current trace and make sure the context is propagated properly.
- Libraries and frameworks can recognize the context propagation mechanism and facilitate capturing different parts of a trace.
- Third-party API providers can provide the context information to their users for further investigation.
- Web browsers can expose the trace context to be used by frontend applications.
These benefits and many others, motivated the W3C TraceContext working group to define a unified standard for context propagation which has been adopted by many of the APM vendors and libraries.
The TraceContext specification defines a format for the propagation of the trace context. The standard defines two main HTTP headers, namely
tracestate header can be used to propagate vendor-specific information. The
traceparent header is enough to identify a trace within a system. It looks something like:
The header value consists of the following parts (separated by a dash
- Trace ID
- Parent ID
- Trace Flags
By adding the
traceparent header to HTTP requests, different components of an application can communicate trace context and report related spans as part of that trace. Once different components of a trace are reported to a central storage, the tracing system can use Trace ID and Parent ID, not only to construct the whole trace but also to show the parent-child relationship between different components. This is usually visualized as a waterfall such as:
The above screenshot is an example waterfall from the Elastic APM trace view. The waterfall contains multiple services (shown in different colors) which contributed to the final response of the original request. Each of the services (instrumented with APM agents) would separately report its internal monitoring of the service. These individual reports are in the form of a collection of spans (individual measurements shown as horizontal bars). Furthermore, the agent would also add the ID of the span that corresponds to an external request to that request in the form of the TraceContext header. Consequently, the upstream service (also instrumented with APM agent) recognizes this header and uses the included trace ID and parent ID to correctly report its own internal monitoring. This process continues until the trace is complete.
Elastic APM and TraceContext
Realizing the benefits of adopting W3C TraceContext early on, we at the Elastic APM team, adopted W3C TraceContext as one of the first implementations of the standard. At the time, the specification was in its early stages. Therefore, we decided to implement the standard with a different header name (elastic-apm-traceparent) than the suggested header name. This was done to avoid breaking APM deployments as the standard evolved during its initial phase.
The W3C TraceContext specification recently entered the W3C recommendation status, which means the specification is endorsed by the W3C Advisory Committee. Therefore, we are also taking the final step to be fully compatible with the specification and use the suggested header name. Furthermore, all of our language-specific agents have fully implemented the W3C TraceContext specification. See our Distributed tracing guide for more information.
Check it out
You can see how your services interact and where they spend their time, discovering bottlenecks, errors, and exceptions, simply by adding a few lines of code to your applications. Get started with Elastic APM by spinning up a free cloud trial or downloading APM server and the free and open default distribution today.