or, the trouble with the monolith
🐧 What is Fred?
One of Medium’s most well-used and beloved internal services is Fred Penguin, our unofficial mascot and homegrown Slack bot.
Every day, Medians use Slack commands like fred helpdesk my laptop is broken! to file support tickets with IT or fred hi5 @jean to publicly celebrate or thank their teammates. For years, Fred was not owned by any single team, with engineers across the company contributing features or bugfixes whenever they found the time.
Over the years, Fred’s responsibilities have grown to include a great many tasks. Some of these are essential to many Medians’ daily workflows, like notifying us when a teammate requests a code review on Github or comments on a Jira ticket we’re working on.
As a monolith, rather than delegating some of its tasks to other services, Fred insists on handling everything by itself. If one of the umpteen things Fred is trying to do doesn’t go as planned, it can take down Fred completely, disrupting work at Medium for many.
💀 The mystery of Fred’s death
It was November 2, 2020, and Fred was dead.
To be more precise, Fred was ceasing to respond to any requests. When even health check requests were failing (Are you there, Fred? It’s me, Kubernetes), Kubernetes would restart Fred. This was happening over and over until Fred got stuck in a crash loop.
When a service breaks, one of the first places we look is its logs. We’re a detective investigating a murder case, and if we’re lucky, the victim was a prolific Twitter user whose last tweet reads, “So I’m in the library and Professor Plum just walked in with a giant pipe in hand and a murderous glint in his eye???”
But when we looked in Fred’s logs, we found that it was dying silently, leaving nary a clue behind. As a result, it was difficult to determine which of Fred’s many parts was causing its untimely and repeated demise.
To make things worse, we were only running one instance of Fred in production. We generally run multiple instances of most of our services, automatically scaling up or down based on our needs in a given moment. For a service as highly trafficked as the Medium.com backend, relying on one instance would be like a busy restaurant relying on a single employee to greet, serve, cook, bartend, bus, and wash dishes.
Even if it’s not strictly necessary in order to handle the quantity of requests a service receives, running multiple instances of a service increases its reliability, or the likelihood that the service will be available when we need it to be. It’s the infrastructure equivalent of not putting all of our eggs in one basket. Fred1’s stuck in a crash loop? No worries, Fred2’s got us covered!
Unfortunately, we couldn’t run more than one instance of Fred. The problem boiled down to two of Fred’s features: processing Slack events and executing cron jobs. Slack events are the Slack triggers that cause Fred to respond (i.e. any of Fred’s commands), and cron jobs are code that is executed on a given schedule. Each Slack event and cron job should be handled exactly once, but with multiple Freds running, the same Slack event or cron job could be handled once by each Fred, resulting in duplicate messages.
🔑 Unlocking the Fred multiverse
To get around this problem, we decided to split Fred into different versions, with each version responsible for a subset of Fred’s tasks. This meant selectively turning certain features on or off in each version of Fred.
Today, we have 4 different versions of Fred running in production: one for handling Slack events and cron jobs, one for handling Github integration, one for handling Jira integration, and one for handling all other HTTP requests.
The version of Fred that handles Slack events and cron jobs can still exist only once. However, the other versions of Fred are free to multiply. We currently run two instances each of Github Fred, Jira Fred, and HTTP Fred, so our total comes out to 7 instances of Fred running in production today.
🌅 From monolith to microservices
We took the decomposition of Fred one step further by creating a new microservice to handle one small function of Fred: managing Medians’ identity data related to Slack, Github, and Jira.
In the future, we plan to continue breaking out some pieces of Fred into their own services.
We don’t know exactly why Fred was getting stuck in crash loops in November. But today, Fred’s future is much brighter. First, having entrusted our proverbial eggs to many different Fred-baskets, we’re much less likely to experience a total Fred outage again. If something goes awry with our Jira integration, our Github notifications will continue to work, and vice versa. Second, if one version of Fred keeps going down while the others are fine, we’ll have a better idea of where to start investigating the problem. Third, my teammate Bob Corsaro created SLOs for Fred, so in the future it will be easier for us to detect when Fred is running into trouble.
The issues we faced with Fred illuminated quintessential pain points of monolithic architecture. When so many distinct functions are handled by a single service with code that is closely intertwined, the service can become brittle and difficult to debug. While a microservice architecture comes with its own tradeoffs, decomposing a monolith by creating boundaries between discrete responsibilities can be an effective way to reduce and isolate the risk of each of those parts.