Your Circuit Breaker is Misconfigured

Circuit breakers are an incredibly powerful tool for making your application resilient to service failure. But they aren’t enough. Most people don’t know that a slightly misconfigured circuit is as bad as no circuit at all! Did you know that a change in 1 or 2 parameters can take your system from running smoothly to completely failing?

I’ll show you how to predict how your application will behave in times of failure and how to configure every parameter for your circuit breaker.

At Shopify, resilient fallbacks integrate into every part of the application. A fallback is a backup behavior which activates when a particular component or service is down. For example, when Shopify’s Redis, that stores sessions, is down, the user doesn’t have to see an error. Instead, the problem is logged and the page renders with sessions soft disabled. This results in a much better customer experience. This behaviour is achievable in many cases, however, it’s not as simple as catching exceptions that are raised by a failing service.

Imagine Redis is down and every connection attempt is timing out. Each timeout is 2 seconds long. Response times will be incredibly slow, since requests are waiting for the service to timeout. Additionally, during that time the request is doing nothing useful and will keep the thread busy.

Utilization, the percentage of a worker’s maximum available working capacity, increases indefinitely as the request queue builds up, resulting in a utilization graph like this:

Utilization during service outage
Utilization during service outrage

A worker which had a request processing rate of 5 requests per second now can only process half a request per second. That’s a tenfold decrease in throughput! With utilization this high, the service can be considered completely down. This is unacceptable for production level standards.

Semian Circuit Breaker

At Shopify, this fallback utilization problem is solved by Semian Circuit Breaker. This is a circuit breaker implementation written by Shopify. The circuit breaker pattern is based on a simple observation: if a timeout is observed for a given service one or more times, it’s likely to continue to timeout until that service recovers. Instead of hitting the timeout repeatedly, the resource is marked as dead and an exception is raised instantly on any call to it.

I’m looking at this from the configuration perspective of Semian circuit breaker but another notable circuit breaker library is Hystrix by Netflix. The core functionality of their circuit breaker is the same, however, it has less available parameters for tuning which means, as you will learn below, it can completely lose its effectiveness for capacity preservation.

A circuit breaker can take the above utilization graph and turn it into something more stable.

Utilization during service outage with a circuit breaker
Utilization during service outage with a circuit breaker

The utilization climbs for some time before the circuit breaker opens. Once open, the utilization stabilizes so the user may only experience some slight request delays which is much better.

Semian Circuit Breaker Parameters

Configuring a circuit breaker isn’t a trivial task. It’s seemingly trivial because there are just a few parameters to tune: 

  • name
  • error_threshold
  • error_timeout
  • half_open_resource_timeout
  • success_threshold.

However, these parameters cannot just be assigned to arbitrary numbers or even best guesses without understanding how the system works in detail. Changes to any of these parameters can greatly affect the utilization of the worker during a service outage.

At the end, I’ll show you a configuration change that drops the utilization requirement of 263% to 4%. That’s the difference between complete outage and a slight delay. But before I get to that, let’s dive into detail about what each parameter does and how it affects the circuit breaker.

Name

The name identifies the resource being protected. Each name gets its own personal circuit breaker. Every different service type, such as MySQL, Redis, etc. should have its own unique name to ensure that excessive timeouts in a service only opens the circuit for that service.

There is an additional aspect to consider here. The worker may be configured with multiple service instances for a single type. In certain environments, there can be dozens of Redis instances that a single worker can talk to.

We would never want a single Redis instance outage to cause all Redis connections to go down so we must give each instance a different name.

For this example, see the diagram below. We will model a total of 3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”.

3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”
3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”

You must understand how many services your worker can talk to. Each failing service will have an aggregating effect on the overall utilization. When going through the examples below, the maximum number of failing services you would like to account for is defined by failing_services. For example, if you have 3 Redis instances, but you only need to know the utilization when 2 of those go down, failing_services should be 2.

All examples and diagrams in this post are from the reference frame of a single worker. None of the circuit breaker state is shared across workers so we can simplify things this way.

Error Threshold

The error_threshold defines the number of errors to encounter within a given timespan before opening the circuit and starting to reject requests instantly. If the circuit is closed and error_threshold number of errors occur within a window of error_timeout, the circuit will open.

The larger the error_threshold, the longer the worker will be stuck waiting for input/output (I/O) before reaching the open state. The following diagram models a simple scenario where we have a single Redis instance failure.

error_threshold = 3, failing_services = 1

A single Redis instance failure
A single Redis instance failure

3 timeouts happen one after the other for the failing service instance. After the third, the circuit becomes open and all further requests raise instantly.

3 timeouts must occur during the timespan before the circuit becomes open. Simple enough, 3 times the timeout isn’t so bad. The utilization will spike, but the service will reach steady state soon after. This graph is a real world example of this spike at Shopify:

A real world example of a utilization spike at Shopify
A real world example of a utilization spike at Shopify

The utilization begins to increase when the Redis services goes down, after a few minutes, the circuit begins opening for each failing service and the utilization lowers to a steady state.

Furthermore, if there’s more than 1 failing service instance, the spike will be larger, last longer, and cause more delays for end users. Let’s come back to the example from the Name section with 3 separate Redis instances. Consider all 3 Redis instances being down. Suddenly the time until all circuits are open triples.

error_threshold = 3, failing_services = 3

3 failing services and each service has 3 timeouts before the circuit opens
3 failing services and each service has 3 timeouts before the circuit opens

There are 3 failing services and each service has 3 timeouts before the circuit opens. All the circuits must become open before the worker will stop being blocked by I/O.

Now, we have a longer time to reach steady state because each circuit breaker wastes utilization waiting for timeouts. Imagine 40 Redis instances instead of 3, a timeout of 1 second and an error_threshold of 3 means there’s a minimum time of around 2 minutes to open all the circuits.

The reason this estimate is a minimum is because the order that the requests come in cannot be guaranteed. The above diagram simplifies the scenario by assuming the requests come in a perfect order.

To keep the initial utilization spike low, the *error_threshold* should be reduced as much as possible. However, the probability of false-positives must be considered. Blips can cause the circuit to open despite the service not being down. The lower the error_threshold, the higher the probability of a false-positive circuit open.

Assuming a steady state timeout error rate is 0.1% in your time window of error_timeout. An error_timeout of 3 will give you a 0.0000001% chance of getting a false positive.

100 *(probability_of_failure)number_of_failures =(0.001)3=0.0000001%

You must balance this probability with your error_timeout to reduce the number of false positives circuit opens. When the circuit opens, it will be instantly raising for every request that is made during error_timeout.

Error Timeout

The error_timeout is the amount of time until the circuit breaker will try to query the resource again. It also determines the period to measure the error_threshold count. The larger this value is, the longer the circuit will take to recover after an outage. The larger this value is, the longer a false positive circuit open will affect the system.

error_threshold = 3, failing_services = 1

The circuit will stay open for error_timeout amount of time
The circuit will stay open for error_timeout amount of time

After the failing service causes the circuit to become open, the circuit will stay open for error_timeout amount of time. The Redis instance comes back to life and error_timeout amount of time passes so requests start sending to Redis again.

It’s important to consider the error_timeout in relation to half_open_resource_timeout. These 2 parameters are the most important for your configuration. Getting these right will determine the success of the circuit breakers resiliency mechanism in times of outage for your application.

Generally we want to minimize the error_timeout because the higher it is, the higher the recovery time. However, the primary constraints come from its interaction with these parameters. I’ll show you that maximizing error_timeout will actually preserve worker utilization.

Half Open Resource Timeout

The circuit is in half-open state when it’s checking to see if the service is back online. It does this by letting a real request through. The circuit becomes half-open after error_timeout amount of time has passed. When the operating service is completely down for an extended period of time, a steady-state behavior arises.

failing_services = 1

Circuit becomes half-open after error_timeout amount of time has passed
Circuit becomes half-open after error_timeout amount of time has passed

Error threshold expires but the service is still down. The circuit becomes half-open and a request is sent to Redis, which times out. The circuit opens again and the process repeats as long as Redis remains down.

This flip flop between the open and half-open state is periodic which means we can deterministically predict how much time is wasted on timeouts.

By this point, you may already be speculating on how to adjust wasted utilization. The error_timeout can be increased to reduce the total time wasted in the half-open state! Awesome — but the higher it goes, the slower your application will be to recover. Furthermore, false positives will keep the circuit open for longer. Not good, especially if we have many service instances. 40 Redis instances with a timeout of 1 second is 40 seconds every cycle wasted on timeouts!

So how else do we minimize the time wasted on timeouts? The only other option is to reduce the service timeout. The lower the service timeout, the less time is wasted on waiting for timeouts. However, this cannot always be done. Adjusting this timeout is highly dependent on how long the service needs to provide the requested data. We have a fundamental problem here. We cannot reduce the service timeout because of application constraints and we cannot increase the error_timeout because the recovery time will be too slow.

Enter half_open_resource_timeout, the timeout for the resource when the circuit is in the half-open state. It gets used instead of the original timeout. Simple enough! Now, we have another tunable parameter to help adjust utilization. To reduce wasted utilization, error_timeout and half_open_resource_timeout can be tuned. The smaller half_open_resource_timeout is relative to *error_timeout*, the better the utilization will be.

If we have 3 failing services, our circuit diagram looks something like this:

failing_services = 3

A total of 3 timeouts before all the circuits are open
A total of 3 timeouts before all the circuits are open

In the half-open state, each service has 1 timeout before the circuit opens. With 3 failing services, that’s a total of 3 timeouts before all the circuits are open. All the circuits must become open before the worker will stop being blocked by I/O.

Let’s solidify this example with the following timeout parameters:

error_timeout = 5 seconds
half_open_resource_timeout = 1 second

The total steady state period will be 8 seconds with 5 of those seconds spent doing useful work and the other 3 wasted waiting for I/O. That’s 37% of total utilization wasted on I/O.

Note: Hystrix does not have an equivalent parameter for half_open_resource_timeout which may make it impossible to tune a usable steady state for applications that have a high number of failing_services.

Success Threshold

The success_threshold is the amount of consecutive successes for the circuit to close again, that is, to start accepting all requests to the circuit.

The success_threshold impacts the behavior during outages which have an error rate of less than 100%. Imagine a resource error rate of 90%, with a success_threshold of 1, the circuit will flip flop between open and closed quite often. In this case there’s a 10% chance of it closing when it shouldn’t. Flip flopping also adds additional strain on the system since the circuit must spend time on I/O to re-close.

Instead, if we increase the success_threshold to 3, then the likelihood of an open becomes significantly lower. Now, 3 successes must happen in a row to open the circuit reducing the chance of flip flop to 0.1% per cycle.

Note: Hystrix does not have an equivalent parameter for success_threshold which may make it difficult to reduce the flip flopping in times of partial outage for certain applications.

Lowering Wasted Utilization

Each parameter affects wasted utilization in some way. Semian can easily be configured into a state where a service outage will consume more utilization than the capacity allows. To calculate the additional utilization required, I have put together an equation to model all of the parameters of the circuit breaker. Use it to plan your outage effectively.

The Circuit Breaker Equation

The Circuit Break Equation

This equation applies to the steady state failure scenario in the last diagram where the circuit is continuously checking the half-open state. Additional threads reduce the time spent on blocking I/O, however, the equation doesn’t account for the time it takes to context switch a thread which could be significant depending on the application. The larger the context switch time, the lower the thread count should be.

I ran a live test to test out the validity of the equation and the utilization observed closely matched the utilization predicted by the equation.

Tuning Your Circuit

Let’s run through an example and see how the parameters can be tuned to match the application needs. In this example, I’m integrating a circuit breaker for a Rails worker configured with 2 threads. We have 42 Redis instances, each configured with its own circuit and a service timeout of 0.25s.

As a starting point, let’s go with the following parameters. Failing instances is 42 because we are judging behaviour in the worst case, when all of the Redis instances are down.

Parameter  Value

failing_instances

42

service_timeout

0.25 seconds

error_threshold

3

error_timeout

2 seconds

success_threshold

2

half_open_resource_timeout

0.25 seconds (same as service timeout)

Plugging into The Circuit Breaker Equation, we require an extra utilization of 263%. Unacceptable! Ideally we should have something less than 30% to account for regular traffic variation.

So what do we change to drop this number?

From production observation metrics, I know 99% percent of Redis requests have a response time of less than 50ms. With a value this low, we can easily drop the half_open_resource_timeout to 50ms and still be confident that the circuit will close when Redis comes back up from an outage. Additionally, we can increase the error_timeout to 30 seconds. This means a slower recovery time but it reduces the worst case utilization.

With these new numbers, the additional utilization required drops to 4%!

I use this equation as something concrete to relate back to when making tuning decisions. I hope this equation helps you with your circuit breaker configuration as it does with mine.

Author’s Edit: “I fixed an error with the original circuit breaker equation in this post. success_threshold does not have an impact on the steady state utilization because it only takes 1 error to keep the circuit open again.”

If this sounds like the kind of problems you want to solve, we’re always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.
Source: Shopify