In this post-mortem blog post we’ll explore the oft-neglected Rack’s design and Ruby web server startup sequence in depth through the lens of a surprising critical issue encountered by a couple of our customers. This post is part one of two, in which we explore the issues, the initial fixes, and dive into Rack’s design.
A bit of context
The Sqreen agent installs inside of your application, and, in order to not block request processing, does as much computing as possible concurrently in a background thread. So, start a thread, set up some queue to push information, and we’re set. Schematically:
Sounds easy enough, right? The truth is, not so much.
Where things go sideways, part one
Once upon a time, the Sqreen Ruby agent was living its merry life, when one day a customer reached out to us on our support line. He was deploying to Heroku and his app refused to start. When he disabled the agent, everything was fine again. We pride ourselves on our reliability and we built the agent to be as non-intrusive to the app as possible. Having a customer application burst into flames definitely does not fit this definition. This was a terrifying worst case scenario for us.
Not to fret, all hands on deck, the team investigated without skipping a beat. The agent crashed, snarkily dumping core. We dug left and right into our code but it seemed sound even though we could reproduce locally on the then shiny new Ruby 2.5.0 (but not on previous versions). Looking into the core dump file, we saw that the issue lay in the very queue that pushes data to the backend thread. The source of the crash was in Ruby’s
Queue management, and Ruby’s VM was blowing up without an apparent legitimate reason on code that has been working for a long time!
We reported the bug upstream, and a little back and forth got the issue fixed on trunk in less than 12 hours (props to the Ruby team!). Good? Sure, but not enough: the fix would only be available on the next Ruby release and even though we had no other customer reports, other customers would be at risk of a serious impact in the meantime. So we set out to find a workaround.
The core issue was that we created a
Queue, which uses threads behind the scenes. Puma is one of the most used web servers in the Ruby ecosystem, and people increasingly started to set it up in cluster mode — a mode that was not available in Puma’s first days. Cluster mode is where the web server forks worker processes to handle requests in parallel, because while Puma can use threads, this only helps with concurrent IO due to Ruby’s GVL. The trouble starts when you create a thread and then fork, as forking brutally kills all threads except the one calling fork. Ruby’s VM is supposed to take care of that for high-level concepts like queues, but had a cleanup-and-restart issue.
So what could we do quickly for versions without the fix? The plan was twofold: detect whether we’re on a forking server – Puma in cluster mode or Unicorn – and if so, set up a clean
Queue and start the background thread only when the workers effectively start.
Okay then, workaround pushed, reviewed, released; customers lived happily ever after. Done. End of story, and end of blog post. You can now return to your usual activities.
Oh, I sure wish it were so…
Where things go sideways, part deux
The fix seemed solid for a long time. But then we started receiving reports on strange behaviors from a few customers, the kind that seem to happen inconsistently, that you cannot reproduce nor reliably explain. Your typical heisenbugs (higgsbugs even), so to speak. These inconsistent behaviors didn’t impact the protection status of the customer’s app, but they continued to pop up.
Clearly, something was afoot.
It was not long after we started to get the sense of a correlation between those haphazard reports that things blew up. The same customer as before, doing an unrelated minor update, experienced a segfault again, in an area that had laid untouched for years. And then a second customer, also on Heroku, provided us with a similar report, with the crash happening in yet another unrelated location.
No way we’re letting that one slip.
A quick analysis of logs showed that our background thread started to run not just in the web server workers, but also in the master process, in spite of the fail-safes we put in place to prevent that. We’re big believers in additional fail-safes, so even though the background threads should not have been running, we put in code that asserted that the thread starting code should bail out on situations where it had no business running. It turns out that things moved ever so slightly across releases of some of the app’s dependencies and the conditions for the fail-safe we built were not resolving properly. Even though we needed a fix in short order for the original problem, we aimed for extra robustness, but unfortunately, even that was not enough.
We got to work, and fortunately, it didn’t take too long. We found a workaround through a feature flag for the first crash experienced that we could push instantly without impacting the actual features our customer used, and the second customer was not in production mode.
With the customer’s production up and running again, we had a little bit of breathing room, so we decided to sit down and fix this, for good this time. That means we had to get down into the rabbit hole of the Ruby web server startup sequence.
Ready to dive in?
A Rack primer
Many web application developers use Rails directly to build their app, and so are mostly unaware of Rack operation behind the scenes. Rack is a brilliant, simple design very much like those that the Ruby community is often capable of coming up with.
If you were to design a HTTP server using nothing but the standard library
TCPServer, this is probably how in some form or another you’d implement handling a parsed HTTP request:
And this is basically what Rack defines!
Now, you may want to separate concerns instead of doing everything in
do things, so you’d call probably wrap around
func1 with some
func2 doing things before and/or after:
func2 depends directly on
func1: the composition is hardcoded. You can fix that by passing in the next step to be called:
You then have a chain of functions to be called:
Now those functions are entirely composable, as you can reorder them any way you see fit.
This “first call” is awfully similar to the boilerplate inside
func2. Let’s up the ante:
You surely noticed that
func1 is a bit special, as it ignores its
stack argument: it’s the application itself, while the others are middlewares. Let’s make that more semantic, and abstract that unused argument away:
Now we can change our app to something more matching to a higher level semantic:
And finally, we are able to describe our stack:
Boom. We have the essence of Rack.
- Rack defines a protocol that isolates the web server implementation, whose role is to handle and parse HTTP requests to a well-defined Ruby data structure, from the application and middlewares, that return a well-defined Ruby data structure that encodes the HTTP response.
- Notice how a
Stackinstance has the same interface as
func1: this means that stacks are actually applications, and that we can nest them using
runfor additional composability!
Modern Ruby web server design
Web servers such as Puma and Unicorn use a forking model, wherein workers are forked from a master, but this results in increased memory consumption if the application is loaded in each worker. To combat that, an application preloading mode is available in such servers, allowing the application to be required and initialized in the master process, before the fork happens. This results in a significant reduction in memory usage for the parts that stay untouched, hinging on the system’s copy-on-write memory pages.
Additionally, Puma can also run in single mode, where it uses threads instead of processes to ensure concurrency, at the cost of parallelism.
Those behaviors are configurable, and while Unicorn is always forking by design, Puma is not. In either case, both can be configured to preload or not.
Starting the Sqreen background thread
If you recall from earlier, our previous fix relied on various preexisting heuristics to detect whether we’re being forked after being loaded. Unfortunately, those heuristics made assumptions about the timeline of events and was therefore sensitive to changes within web servers. Well, those changes happened, and that was what triggered the mis-detection. So how did we fix it?
We hook on
#to_app, whose role is building the full stack of middlewares as the last step of
Rack::Builder after it reads Rack’s
config.ru file (which contains the
run directives). Instead of using heuristics, there was for sure a way to obtain the result of Puma, Unicorn, or Passenger’s own configuration to detect without a doubt:
- whether we preload
- whether we’re forking
- whether we are in the master process or in a worker process
We started the work to implement true access to the final configuration of the running web server and use of the dedicated web server hooks to delegate actual boot to workers. The next step was testing, starting on Unicorn from 1.0 to 5.5, Puma from 1.0 to 3.12, Ruby from 1.9.3 to 2.6, Rails from 3.2 to 6.0, and with each of the fork/preload configurations. That makes for one hell of a test matrix, but finally, we got everything to green.
So, problem solved, right? Let’s release! But in spite of all our testing beforehand, as soon as we ran our full integration test suite on CI, we were hit by an unexpected red light.
To be continued: a journey into Ruby web server startup sequences
Unfortunately, despite everything looking great in our testing, this is not the end of the story. We have a lot more to dive into with Ruby web server startup sequences. Since it’s a long tale, we decided to split up the telling of it into two parts. Part two will be coming up soon!
The post Fixing a critical issue: a journey into Ruby web server startup sequences, part one appeared first on Sqreen Blog.