Envoy Mobile v0.2 deep dive

In November we released Envoy Mobile v0.2. In the accompanying blog post we detailed the features that the library supported and announced that we had replaced the networking libraries in Lyft’s alpha rider app with Envoy Mobile. In this blog post, I want to expand on the technical aspects of the v0.2 release and take a technical deep dive of Envoy Mobile’s current architecture.

Why Bring Envoy… to Mobile?

Many companies, Lyft included, have realized the benefits of deploying Envoy across their backend infrastructure: consistent behavior, dynamic configuration, and best-in-class observability, among others. Envoy handles every network hop in the data center. From the network edge, into the service mesh, even when egressing the VPC, Envoy takes care of all the network traffic.

Envoy has done a great deal to help us achieve high reliability server-side: having a universal network primitive that is performant, reliable, configurable, and extensible has made the network largely transparent to server engineers. And importantly, Envoy’s best in class observability has made it increasingly easy to debug issues when they arise.

The realization that brought us to Envoy Mobile was that solving these problems in the data center was not enough, and over the last 6 months we have worked toward a v0.2 release where we could start to reap the benefits of a truly universal network primitive.

What is in v0.2?

From our v0.2 announcement:

Envoy Mobile v0.2 is a fundamental shift in how mobile clients use Envoy. Envoy Mobile now provides native Swift/Kotlin APIs that call through to Envoy directly (rather than using Envoy as a proxy), which apps use to create and interact with network streams.

This release includes a variety of new functionality:

1.HTTP request and streaming support.

2.gRPC streaming support through a built-in codec.

3.Automatic retries using Envoy’s retry policies.

4.Programmatic, typed configuration for launching the Envoy network library.

Library Internals

Let’s see how the library was designed from the ground-up to achieve the functionality described above. This section will discuss:

  1. The build system and how it relates to the overarching organization of the library.
  2. The lifetime of a request through the Envoy Mobile Engine and how we leverage existing Envoy constructs to ensure networking performance, reliability, and DRY code.

Hopefully this deep dive is not only informative and interesting, but also entices you, the reader, to jump on board and start exploring the library!

10,000 Foot View

There are two main organizational dimensions in the Envoy Mobile project: API layers, and threading contexts. This section will describe both of them.

API Layers

At a high level, Envoy Mobile is a project in three parts:

  1. Platform code: all the code that is written for either Android/JVM (in Java and Kotlin) or iOS (Objective-C and Swift). A fundamental goal of the library is cross-platform consistency. Therefore, the code in this layer is largely identical between both platforms.
  2. Bridging code: common C types that allow us to bridge from the platform specific constructs down to the C++ codebase. Note that this layer can also one day enable us to write bindings for other languages like Python, Rust, etc. (hint hint wink wink).
  3. Native code: written in C++ and interfaces directly with the rest of the Envoy codebase.
the three layers of Envoy Mobile

This layered architecture is closely tied with how we build the library. Envoy itself is built with Bazel, which already has good tooling for multi-platform/multi-language projects, making it an excellent fit for building Envoy Mobile. Getting this set up was not trivial, and much of the initial effort of the project was put towards successfully compiling for each of the required languages and platforms. After completing the initial setup, things have worked reasonably well, most of the time.

The diagram below shows how the Bazel targets map over from the tiered architecture discussed in the previous section:

high-level bazel targets are closely tied with how the library is architected

On the left, in red, we have platform-specific targets for iOS and Android. In the middle, in blue, are the targets for the bridging code. And lastly on the right, in green, is the native C++ code (both the core of the library, and Envoy itself, as a dependency). This gives us an overview of one fundamental dimension of the project: how we organized the library and thought about its API surface.

Threading Contexts

The other foundational concern was how we took something that was meant to be run as a multi-threaded process and ran it in a single threaded context within a sandboxed mobile application. In other words, running Envoy as an Engine, rather than a process. This dimension gave us the threading context in which the library had to be divided:

  1. The application threads that interact with the Engine and issue network calls.
  2. The main Envoy thread which runs the Engine.
  3. The callback threads where the Engine surfaces callbacks when responses are issued (whether from over the network or generated by the Engine).

If we layer these threading concerns on top of the API layers of the library, we get a handy-dandy matrix that allows us to explain the lifetime of a request and the library components that make it happen:

conceptual matrix of the Envoy Mobile library

Library Lifecycle

Using the mental model described above, we can explore the lifecycle of network I/O in Envoy Mobile.

Envoy Mobile exposes a top level Client interface (iOS/Android) that allows the application to start the Envoy Engine and provide it with a configuration. Although this Client is held in the application thread, the Engine is run in a separate native C++ thread — the Envoy main thread.

It is worth noting that Envoy itself contains mostly single-threaded code. When Envoy is running as a server, its main thread is responsible primarily for lifecycle tasks and bookkeeping. When a listener receives network events on a socket, its tasks are handled in worker threads. If a worker needs to interact with the main thread, it uses an event dispatcher interface largely built on top of a libevent implementation (more information on Envoy’s threading architecture can be found in this blog post). Lastly, if a worker thread needs to issue network I/O off-band, it does so via an AsyncClient (code here).

In Envoy Mobile, we leveraged these existing Envoy constructs (cross-threading dispatch, and the AsyncClient). By hooking the Engine into the Envoy’s main thread event dispatcher and AsyncClients running in the context of the main thread. The result of this approach is that all the work is dispatched from the application threads to the Engine’s main thread and out to the network via existing mechanisms in Envoy.

This means that we are harnessing Envoy Mobile into the battle-tested and hardened code paths of Envoy itself.

the Envoy Mobile engine leverages Envoy constructs.

Like Envoy, Envoy Mobile considers streaming as a first class construct and is designed to work seamlessly with persistent streams while providing convenient traditional APIs. This forward-thinking approach ensures flexibility within the library and a path to unlocking new streaming protocols like QUIC in the future. This is largely evident due to the fact that we leveraged Envoy’s AsyncClient, but also by the streaming-based API exposed in Envoy Mobile’s bridge layer here.

When the application creates and dispatches a request via Platform level APIs, Envoy Mobile transforms that request into a new HttpStream that lives in the Bridge layer. Before the Engine can pass handling of those API calls down to Envoy’s main thread (and thus to C++ memory constructs), we have to safely manage the ownership and lifecycle of memory passed by the application down to memory that will be handled by Envoy.

This was another case where Envoy Mobile was able to leverage pre-existing functionality in Envoy. Envoy has an elegant buffer abstraction. Buffers are composed of buffer fragments that have callbacks that indicate when a buffer has been drained and the underlying memory is no longer needed. Perfect! This allowed us to directly tie platform memory management schemes to the lifecycle of memory usage in core Envoy code without any special bookkeeping — all in a platform-agnostic fashion.

To illustrate this example we can take a look at passing a Java ByteBuffer through Envoy Mobile. The ByteBuffer is converted to envoy_data in the following code:

https://medium.com/media/5e9f49780dd79a10118f2e832e4a321a/href

Importantly, note that there is no copying being done; envoy_data is wrapping the memory of the ByteBuffer directly. Additionally, it attaches a static release function (which deletes the global ref captured here), and the Java object as the type-erased context.

The envoy_data is then given to Envoy as an unowned Buffer fragment which, as described above, gets released when Envoy is done with the wrapping Buffer. The effect that this creates is that Envoy Mobile avoids unnecessary buffer copies even when passing memory blocks across threading and API boundaries — all without complicated bookkeeping.

Once the memory barrier is bridged and Platform-specific constructs are transformed into Envoy constructs (e.g from envoy_headers in the Bridge layer to Http::HeaderMapin Envoy’s codebase), Envoy Mobile uses the event dispatcher to inject HTTP events into the AsyncClient. This integration is an instance where some work was done in upstream Envoy to fully enable this solution. We envision this type of collaboration to continue to flourish out of the partnership between Envoy and Envoy Mobile.

This gives us a fully functional solution in the outbound direction, where the application issues network I/O via Envoy Mobile.

Once the request is sent to the network and a response is received, Envoy Mobile needs to traverse in the inbound direction to surface the response back to the Platform layer. Envoy’s AsyncClient fires an array of callbacks that were easily leveraged in Envoy Mobile’s codebase — another instance where Envoy Mobile’s deliberate design allowed us to utilize existing mechanisms in the Envoy codebase.

To prevent arbitrary user-provided Platform code from running and potentially blocking Envoy’s main thread, Envoy Mobile uses Platform-specific dispatching mechanisms (Grand Central Dispatch on iOS and Executors on Android) to fire callbacks on application threads using Platform-specific lambdas. This presented another interesting challenge: C function pointers (remember the Bridge layer is written in C) have no facility to capture state as true lambdas do. To facilitate mapping to Platform-specific callback abstractions, Envoy Mobile captures Platform-specific context in a type-erased manner. This allows Envoy to hold that context in a way that is memory-agnostic of its contents. Upon dispatch, that context is used to reconstitute the high-level Platform lambdas supported by the API, thus allowing Platform-rich handling of HTTP callbacks.

For instance, when iOS builds its envoy_http_callbacks, it assigns static callback functions that are capable of reconstituting the type-erased context and calling the Platform-given callbacks on a Platform dispatch queue. For example:

https://medium.com/media/11ef6efd36c8539273df71f3ec09d3e6/href

Note that this static function reconstitutes the void* context into a defined type known to the Objective-C code. The code is then able to get the Platform-specific EnvoyHTTPCallbacks object which has the user-provided dispatch queue, and the user-provided block to execute.

Lastly, we made a deliberate design decision in the library to avoid synchronization between the outbound and inbound directions (with the exception of stream cancellation state, which is checked atomically before callbacks dispatch to the application). All the management of the request and response state is delegated to the Engine, and Envoy Mobile enforces contracts in its internal APIs to keep these paths independent. This results in a dramatically simpler implementation in the library itself as well as in external code consuming the library.

Just like that, we have made a full trip around the lifecycle of HTTP traffic in the Envoy Mobile library:

The days ahead!

As you can see in this blog post, Envoy Mobile is built using deep integration with the Envoy codebase, and leverages a lot of the technologies built into Envoy. Although Envoy Mobile already takes advantage of Envoy’s AsyncClient, this is just the beginning.

One of Envoy’s most powerful features is the concept of L7 filters. These filters are written to operate on HTTP-level messages without knowledge of the underlying protocol version. This feature has allowed complex deployments server-side, enabling functionality like health checking, rate limiting, or even fronting storage primitives like DynamoDB.

Since this project’s onset, we have been working toward solutions that open the doors for doing similarly complex operations on mobile clients agnostic of the architecture — solving problems like deferred requests, OAuth, and compression in one place across both mobile platforms. We believe Envoy’s L7 filter system is the best place to accomplish this.

One of the issues we opened in our fully public v0.3 roadmap was to work on Envoy Mobile’s filter chain support. In order to extend Envoy Mobile to have access to L7 filters, we needed to replace the Async Client-based direct API for one that harnesses Envoy’s HTTP connection manager. This is exactly what we did in late December by exposing an Envoy API listener, and leveraging that in Envoy Mobile. Excitingly, with those two PRs we were able to fully test Envoy’s L7 filters in Envoy Mobile, starting with the dynamic proxy filter which gives our mobile APIs even more flexibility as Envoy Mobile does not need to resolve DNS addresses a priori. This is just a taste of the power that we will be unlocking with the next version of Envoy Mobile!

Other Resources

This blog post dove deep into the internals of Envoy Mobile v0.2. It is our goal to develop Envoy Mobile out in the open and to distribute a rich set of materials which other parties in the industry can start leveraging, building upon, and collaborating on.

Here are a few other resources that compliment this post:

  1. If you prefer video: a lot of the content in this blog post was shared in two talks Mike Schore and I gave at EnvoyCon and KubeCon NA 2019. If watching us explain the material is easier than reading about it, feel free to head on over!
  2. This blog post compliments Michael’s blog post about Lyft’s mobile networking journey very well. In his blog post he takes us through a brief history of API design at Lyft, and how we have evolved mobile APIs over the last few years. In other words, how we focused on the “what” that we transmit over our APIs, and how that culminated in Envoy Mobile: an opportunity to focus on “how” we transmit over our APIs.
  3. The Envoy Mobile repo, where we host our open source roadmap. We do this to encourage you to actively participate in this project!

Stay tuned, we have a lot more content coming over the next few months, from how we are able to observe the network using Envoy Mobile metrics, to a deeper dive into the filter chain integration teased above!

If you’re interested in joining the team at Lyft, we’re hiring! Feel free to check out our careers page for more info.


Envoy Mobile v0.2 deep dive was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Lyft