Applying gevent learnings to deliver value to users (Part 4 of 4)

Background

Earlier this year, Lyft launched a feature to allow users to compare bike/scooter modes against rideshare and transit modes. The goal of this project was to provide users with all the upfront information required in order to plan a bike or scooter journey, including the ETAs, routing and pricing information. After releasing the feature to our users, we noticed a glaring problem in the UX — bike and scooter modes would often flicker in and out, leading to a confusing end user experience.

Notice how bike and scooter ‘offers’ doesn’t appear at first, leading to a confusing user experience

This post outlines how we used our gevent learnings from Posts 1, 2, and 3 to root cause and fix the problem. This post is relevant to anyone leveraging latency-sensitive Python microservices that have mixed CPU and I/O workloads.

A 10,000 foot view of the problem

When a Lyft user enters an origin and destination in the Lyft app, mobile clients send a request to our backend services, asking them to generate a set of ‘offers’ for all the transportation modes that Lyft offers in the user’s region. An ‘offer’ comprises details of the mode pertinent to the user’s journey, notably the cost, ETA and route.

Our backend services compute ‘offers’ for each Lyft mode in parallel — we set a hard cap of 500ms on the time it takes to generate each offer to ensure that the end user experience is snappy. If an offer cannot be generated within these response time constraints, the offer is disregarded and the ones that have successfully been generated are returned to users. Clients poll our backend services every couple of seconds in order to keep this list of offers fresh for a user.

We found that bike and scooter offer generation was taking longer than 500ms in some cases, which meant that on some client polls, our backend services rendered bike and scooter offers, whereas on subsequent polls, bike and scooter offers failed to generate within the response time constraints and were disregarded.

To give you a sense of how bad this problem was, 0.5% of bike/scooter offers were failing to be generated due to timeouts, putting us at 2.5 nines of reliability — not great for one of the highest-value user experiences in the Lyft app.

Success rate (%) of bike and scooter offer generation

Our p95 latencies were doing pretty well and consistently below 450ms.

p95 latencies (ms) of bike and scooter offer generation

However, our p99 latencies were as high as 1 second in some Lyft regions (2x our p95s latency).

p99 latencies (ms) of bike and scooter offer generation

p999 latencies of bike and scooter offer generation were an astonishing 1.5 seconds (well over 3x our p95 latency).

p999 latencies (ms) of bike and scooter offer generation

The question after digging into this data was ‘why is there such a high variance between our p95, p99 and p999 latencies?’

To answer this question, let’s dig deeper into how a bike or scooter offer is generated. The backend service responsible for generating Lyft offers, (let’s call it offerservice) computes each offer in parallel. In order to compute the price of bike and scooter offers, offerservice, a Go microservice, makes an upstream request to pricingservice, a Python microservice.

Here’s an ASCII data flow diagram of how a bike and scooter offer is generated:

Mobile client (iOS/Android) -> offerservice (Go) -> pricingservice (Python) -> upstream microservices (Python + Go)

You may be wondering why we’ve explicitly called out which languages our backend microservices are written in. This is an important detail to keep in mind and it’ll become apparent why shortly.

Root causing

When digging into upstream service request latencies of pricingservice, we noticed that some calls were reported as taking ~110ms by the upstream service, but pricingservice’s instrumentation reported latencies as high as 5.3 seconds, a ~35x discrepancy.

~35x discrepancy between latencies reported by pricingservice instrumentation and instrumentation from an upstream service

Clearly, the delta in these two sets of latencies was unaccounted for and we needed to figure out where it was being spent. We made use of distributed tracing to debug where the delta was being spent (to learn more about how Lyft set up distributed tracing, you can check out this talk).

Using our distributed traces, we found that upstream requests from pricingservice were being queued and were spending a long time waiting around. We suspected that this might have something to do with the way that pricingservice’s web server was scheduling requests.

As mentioned earlier, pricingservice is a Python microservice which leverages Flask as its web application framework. Gunicorn is the web server used to front Flask and gunicorn uses gevent to schedule multiple requests at the same time.

Let’s recall that Python’s global interpreter lock effectively makes any CPU bound Python program single threaded. So, how do gunicorn and gevent serve thousands of incoming requests per second with this constraint? gevent spins up a greenlet (also known as green thread) for every request received by the microservice. These greenlets are scheduled to be run using an event loop using the cooperative multitasking framework we talked about in Part 1.

A typical greenlet for a request (let’s call this greenlet A) might do some work on the CPU, make an upstream service call (which involves blocking I/O) and finally perform some more work on the CPU. While performing blocking I/O, i.e the network request in this case, the greenlet will yield control to the event loop.

Gevent can then find another greenlet (let’s call this greenlet B) to schedule on its event loop. Once greenlet B is done, if greenlet A has finished its blocking I/O, the event loop schedules greenlet A to be run again. Greenlet A completes its CPU work and is taken off the event loop. The event loop now finds the next greenlet to run.

The key takeaway is that when a greenlet makes a network call, we wait not only for the call to complete, but also for the event loop to once again run our greenlet. If we run too many concurrent greenlets, CPU work for other requests can “block” our request from running.

Let’s illustrate with an example (imagine that each cell represents a 10ms time slice). Our networking infrastructure would measure the latency of Request 7’s network call as 70ms, however, the application would measure it as 180ms since we waited an additional 110ms to get scheduled on the event loop. CPU work for other requests “block” our request from running.

In the above example, running 3 concurrent requests would be nearly perfect:

Now, how does this relate to pricingservice and the high response times we were seeing earlier?

pricingservice houses hundreds of different endpoints that have varying levels of CPU and I/O workloads. Some endpoints are CPU intensive, whereas some (such as the endpoint that prices bike and scooter offers) are I/O heavy (since they call several upstream services).

The greenlets for CPU intensive endpoints in pricingservice were blocking greenlets for I/O heavy requests from being run. This was leading to the bike and scooter pricing endpoint’s greenlets being queued up and waiting longer to be scheduled by the event loop.

To solve this problem, we decided to run this endpoint on dedicated application service processes so that we’d get dedicated gunicorn processes to field our I/O heavy bike and scooter pricing requests. These requests and their greenlets would no longer need to contend with CPU intensive greenlets from other types of requests.

Using tooling built by our infrastructure teams, we were able to run our bike and scooter pricing endpoints on dedicated processes and we pointed downstream services at these dedicated hosts for all bike and scooter pricing requests. Once we rolled these changes out, we got exactly the results that we were hoping for!

Our success rate (%) leaped from ~99.5 to ~99.995.
A huge dip in the number of timeouts.

Timeouts for requests to our bike and scooter pricing endpoint stopped immediately, which led to the flickering we described earlier stopping. Reliability of bike and scooter pricing requests leaped from an average of 2.5 to 4.5 nines — a massive improvement!

Takeaways

The key takeaway for readers is that running a high throughput Python service (which leverages gevent) that runs mixed CPU and I/O workloads might lead to CPU intensive requests starving I/O intensive requests. We recommend employing distributed tracing to dig into request lifecycles and employing the recommendations from this gevent series to improve your Python service’s latencies and reliability.

Hopefully you’ve found this 4-part series on gevent to be useful and we hope that you’ll be able to leverage these learnings to deliver value to your users like we were able to.

This post wouldn’t have been possible without help from Roy Williams, David Quaid and Justin Phillips. Additionally, a huge thanks to Ilya Konstantinov, Daniel Metz, and Jatin Chopra for their advice and for reviewing this post.

Lyft is hiring! If you’re interested in improving peoples lives with the world’s best transportation, join us!


Applying gevent learnings to deliver value to users (Part 4 of 4) was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Lyft