Breaking up Big Fred

or, the trouble with the monolith

🐧 What is Fred?

One of Medium’s most well-used and beloved internal services is Fred Penguin, our unofficial mascot and homegrown Slack bot.

Fred Penguin: Everyone’s helper, everyone’s friend.

Every day, Medians use Slack commands like fred helpdesk my laptop is broken! to file support tickets with IT or fred hi5 @jean to publicly celebrate or thank their teammates. For years, Fred was not owned by any single team, with engineers across the company contributing features or bugfixes whenever they found the time.

Over the years, Fred’s responsibilities have grown to include a great many tasks. Some of these are essential to many Medians’ daily workflows, like notifying us when a teammate requests a code review on Github or comments on a Jira ticket we’re working on.

As a monolith, rather than delegating some of its tasks to other services, Fred insists on handling everything by itself. If one of the umpteen things Fred is trying to do doesn’t go as planned, it can take down Fred completely, disrupting work at Medium for many.

💀 The mystery of Fred’s death

It was November 2, 2020, and Fred was dead.

A graph showing a drastic increase in Fred restarts at the beginning of November
Fred restarts, Oct 27-Nov 2

To be more precise, Fred was ceasing to respond to any requests. When even health check requests were failing (Are you there, Fred? It’s me, Kubernetes), Kubernetes would restart Fred. This was happening over and over until Fred got stuck in a crash loop.

When a service breaks, one of the first places we look is its logs. We’re a detective investigating a murder case, and if we’re lucky, the victim was a prolific Twitter user whose last tweet reads, “So I’m in the library and Professor Plum just walked in with a giant pipe in hand and a murderous glint in his eye???”

But when we looked in Fred’s logs, we found that it was dying silently, leaving nary a clue behind. As a result, it was difficult to determine which of Fred’s many parts was causing its untimely and repeated demise.

To make things worse, we were only running one instance of Fred in production. We generally run multiple instances of most of our services, automatically scaling up or down based on our needs in a given moment. For a service as highly trafficked as the backend, relying on one instance would be like a busy restaurant relying on a single employee to greet, serve, cook, bartend, bus, and wash dishes.

An animated GIF from the video game Overcooked, with four chefs frantically running around a kitchen that is on fire
Did I mention that the restaurant serves approximately 5 million people each day?

Even if it’s not strictly necessary in order to handle the quantity of requests a service receives, running multiple instances of a service increases its reliability, or the likelihood that the service will be available when we need it to be. It’s the infrastructure equivalent of not putting all of our eggs in one basket. Fred1’s stuck in a crash loop? No worries, Fred2’s got us covered!

A man is mowing the lawn outside. Woman A: “Elizabeth! I thought Bob died last year.” Woman B: “I had him cloned.”
That Bob, always so reliable

Unfortunately, we couldn’t run more than one instance of Fred. The problem boiled down to two of Fred’s features: processing Slack events and executing cron jobs. Slack events are the Slack triggers that cause Fred to respond (i.e. any of Fred’s commands), and cron jobs are code that is executed on a given schedule. Each Slack event and cron job should be handled exactly once, but with multiple Freds running, the same Slack event or cron job could be handled once by each Fred, resulting in duplicate messages.

🔑 Unlocking the Fred multiverse

To get around this problem, we decided to split Fred into different versions, with each version responsible for a subset of Fred’s tasks. This meant selectively turning certain features on or off in each version of Fred.

Today, we have 4 different versions of Fred running in production: one for handling Slack events and cron jobs, one for handling Github integration, one for handling Jira integration, and one for handling all other HTTP requests.

A diagram illustrating Fred as a single monolith application vs. “Multifred,” a service comprised of Github Fred, Jira Fred, HTTP Fred, and Solo Fred
Past vs. present

The version of Fred that handles Slack events and cron jobs can still exist only once. However, the other versions of Fred are free to multiply. We currently run two instances each of Github Fred, Jira Fred, and HTTP Fred, so our total comes out to 7 instances of Fred running in production today.

🌅 From monolith to microservices

We took the decomposition of Fred one step further by creating a new microservice to handle one small function of Fred: managing Medians’ identity data related to Slack, Github, and Jira.

A diagram illustrating smaller Fred applications receiving data from a suite of future microservices
The Ghosts of Microservices Future 👻

In the future, we plan to continue breaking out some pieces of Fred into their own services.

✌️ Conclusion

We don’t know exactly why Fred was getting stuck in crash loops in November. But today, Fred’s future is much brighter. First, having entrusted our proverbial eggs to many different Fred-baskets, we’re much less likely to experience a total Fred outage again. If something goes awry with our Jira integration, our Github notifications will continue to work, and vice versa. Second, if one version of Fred keeps going down while the others are fine, we’ll have a better idea of where to start investigating the problem. Third, my teammate Bob Corsaro created SLOs for Fred, so in the future it will be easier for us to detect when Fred is running into trouble.

A photo of the mysterious Utah monolith
The Utah monolith: gone, but not forgotten

The issues we faced with Fred illuminated quintessential pain points of monolithic architecture. When so many distinct functions are handled by a single service with code that is closely intertwined, the service can become brittle and difficult to debug. While a microservice architecture comes with its own tradeoffs, decomposing a monolith by creating boundaries between discrete responsibilities can be an effective way to reduce and isolate the risk of each of those parts.

Breaking up Big Fred was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Medium

Leave a Reply

Your email address will not be published.