Access patterns dictate how data should be stored. Frequently written data should be in a row-oriented format, while frequently queried data should be in a column oriented format.
Mixpanel’s in-house database, Arb, supports both real-time event ingestion and fast analytical queries over all history. Storage nodes read data from a queue and write it to disk using a row-oriented write ahead log (WAL).
We run a service called compacter which converts data from the row format to a columnar format. This service runs in an autoscaling nodepool on Google Kubernetes Engine (GKE). When a storage node has a row file of a certain size or age, it sends a request to a random compacter node to compact it. The compacter returns either a handle to the resulting columnar file or an error. If all compacters are busy, they shed load and storage nodes retry with some backoff.
While compacter runs outside the critical path of queries, it must be reliable and efficient, since all new data in Arb passes through it daily. Given the nature of the work it has to do, the compacter service is very computationally expensive.
Compacter costs grew steadily in the latter half of 2019, in part due to our growing customers and in part due to new use cases for data transformation.
Worse yet, compacter failed to autoscale at the right times to absorb spikes in load. This resulted in engineers manually setting the autoscaler’s minimum number of replicas to a high number when load was high or during a service-wide incident. This cost us both in engineer time and money spent on wastefully provisioned instances.
A Skew Problem
We first looked at utilization, which for a compute-bound autoscaling service, we expected to be 80–90%. This assumption seemed reasonable because of the way Kubernetes’ Horizontal Pod Autoscaler (HPA) works. You can configure it with a target service (compacter) and a target metric for that service (average CPU utilization of 90%). If CPU utilization exceeds the target by some threshold, it schedules more pods, and vice-versa if utilization is lower. Some hysteresis is built in to avoid constant thrashing.
In our case, however, average CPU utilization was ~40%. Averages can hide skew, so we plotted the median and 90th percentile utilization. This showed that half of the compacters did little work, while the top 10% were maxed out! Aggregate utilization was low because the autoscaling algorithm uses averages, not percentiles, to make its scaling decisions:
At first, we found this skew surprising:
- Storage nodes send requests to compacters based on a uniformly random distribution
- Compacters receive 1000s of requests per minute, so through law of large numbers, these requests should have balanced over the 30–40 nodes we had.
It turns out that even if you randomly distribute requests, skew can occur when the individual load items have a very uneven distribution. This happens in our case because of the vast range of customers we have, ranging from startups with thousands of events per day to large companies with billions per day. This type of power law, where your largest users are significantly larger than your smallest ones, makes simple averages a lot less useful.
Exploring the Solution Space
Once we identified skew as the issue, we considered a few solutions of varying complexity:
- (High) Use a queue: Insert a queue between our storage nodes (clients) and compacters. This switches from the current push-model to an asynchronous pull-model, where compacters only take work when they have capacity. While this leads to optimal utilization, it requires a fundamental architecture change, adds a queue component, and requires keeping track of what compaction work is in-flight.
- (Medium) Use a proxy: Insert a load balancing service between storage nodes and compacters. This service can keep track of the load of all compacters, and forward incoming requests to the least loaded compacter. This maintains our current push-based architecture, but adds a proxy in between. We considered this, but decided to validate our hypothesis with something simpler.
- (Low) Power of 2 Random Choices (P2C): The idea here is simple. Instead of storage nodes randomly picking 1 compacter, they randomly pick 2. They then ask each compacter for its current load, and send the request to the less loaded of the two. In theory, this comes within a constant factor of the above two solutions. We also wrote a quick Python simulation to confirm our intuition on how this works.
Power of 2 Random Choices FTW!
We started with the P2C approach because we could implement it in less than a day. And we were not disappointed. Here is the median and 90th percentile load across our compacter nodes before and after implementing Power of 2 balancing.
With the above improvement to balancing, and a few other smaller tweaks, we could reliably use average utilization to autoscale and saw compacters react predictably to load spikes. In turn, we saw both our average utilization increase to ~90% and our error rate drop to nearly 0 in steady state since we no longer needed to shed load.
Cost-Efficient Autoscaling with Preemptible VMs
Given our increased confidence in the system’s behavior, we decided to take on a more advanced autoscaling strategy using preemptible VMs.
Preemptible VMs are low-cost VMs offered by Google which are ideal for stateless, background workloads. While 1/3rd the cost of regular VMs, they come at a reliability price: Google can arbitrarily terminate preemptible VMs when their capacity is limited.
We previously used preemptible VMs exclusively for Compacter. This was fine in the happy case, but catastrophic when Google would terminate them. So we decided to switch entirely to regular VMs, which improved reliability, but was more expensive.
This time, though, we wanted the best of both worlds: what if we could get the cost benefits of preemptible VMs in the common case, but with the reliability benefits of regular VMs in the worst case?
Turns out, we could! After doing some research and finding this excellent article, we configured our nodepools such that Kubernetes will automatically try to use the cheaper VMs (when available) and fall back to the more expensive ones otherwise.
We shipped both of these optimizations over the course of 2 weeks in January 2020, and they dropped our compacter service costs nearly 70%. Importantly, these stood the test of time with minimal follow-up from the team.
We learned a few lessons along the way that we hope are useful to teams building high scale services in the cloud:
Autoscaling is hard to get right! It requires a high-signal target metric that is resilient to skew. In some cases, you can front your autoscaling service with a queue and autoscale on queue length. However, when callers block waiting for the result or an error, introducing a queue can add substantial complexity to the system.
Be suspicious of averages. Always look at percentiles or a full distribution to better understand your load. Because of power law, there may be a few requests that constitute a significant part of your overall load.
P2C is a remarkably simple trick to address skew. In subsequent months, we applied it to a few other compute-intensive autoscaling services at Mixpanel with similar success. That said, it’s not without tradeoffs. Here are two caveats to consider before using P2C:
- It requires all clients to cooperate. If your service has just 1 client that you control, you can ensure that the client first checks for load and always sends requests to the less-loaded node. However if you have many clients, or ones that are out of your control, this does not work. If even one client doesn’t respect the power of 2 protocol, all bets are off in terms of skew! In our case, there are now a small number of internal clients to Compacter. We created a client library that encapsulates the P2C logic and other conveniences, which all clients use. If clients are truly external, it’s better to use an off-the-shelf proxy/load balancer to route requests (which may itself use P2C under the hood!).
- It introduces two extra round trips on each request to check load on 2 random nodes. This overhead is fine when the actual requests are somewhat heavy (like in our case) and client-server latency is negligible (eg: internal services in a single GCP zone). It’s not fine when the requests themselves are small or the round-trip latency between client and server is high.
If you’re interested in creative solutions to challenging infrastructure problems, we’re hiring! Reach out to us via our careers page.