We Don’t Get Bitter, We Get Better

How we brought site reliability engineering standards and thinking to The New York Times technology teams.

Illustration by Jake Terrell

By Tim Burke

When The New York Times launched its first website in 1996, it had already been using software to send the layout of the next day’s paper to the printing presses. The website introduced some basic web technology into The Times’s portfolio. Twenty-five years later, that portfolio has grown to include systems that deliver articles, documentaries, newsletters, games and recipes, both online and in print, to millions of readers every day.

It’s a technology landscape that is diverse and complicated. Our systems use both bleeding-edge and legacy technologies that are deployed to multiple clouds and data centers all over the world.

Because of the breadth of Times products, our engineering teams have the freedom to make their own decisions about their technology stacks and the architecture of their systems. As a result, some of The Times’s systems have been around since before Amazon Web Services had a web console, and other systems use the latest tools from Google Cloud Platform. With such a broad portfolio comes a range of operational thinking and preparedness across our engineering team.

Over the past couple of years, the Operations Engineering team (formerly Site Reliability Engineering) has worked to foster a culture of operational maturity that allows our engineers to efficiently build and maintain our technology while ensuring that we deliver a consistent experience to our readers.

Part of our work has been to define what operational maturity means in the context of The Times. For us, it is requiring teams to take a hard and objective look at their practices and ask, “how is she though?” Does a system have a disaster recovery plan? Does a system have a failover configuration in place in the case of a cloud provider outage? Does a team ineffectively translate business metrics into supply metrics and just over-provision and pray? Ultimately, we want teams to recognize that failure is a part of growth, and consistent recovery from failure comes from having mature practices and playbooks for what to do when things go wrong.

We took inspiration from the Production Maturity Assessment released by Google’s Customer Reliability Engineering team, and using their example, we developed our own Operational Maturity Assessment to provide insight into the operational maturity of The Times’s engineering teams.

But, how is she though?

The Operational Maturity Assessment is an activity that teams do together. A member of the Operations Engineering team facilitates each session, and the activity is designed to foster discussion and self-reflection within the team. (When we designed this assessment, we imagined teams gathering in person to eat pizza and complete it together. But, that was before the pandemic. Alas.)

Each session lasts from one to two hours and a member of the team drives the assessment from their screen using a tool we built. The assessment tool we built was inspired by tax preparation software and breaks six top-level categories down into manageable sections.

As the team moves through the categories — we will get into them in a minute — they answer prompts around topics that are intended to inspire conversation about whether their systems meet the maturity expectations of The Times. We want the assessment to provide a framework for thinking through parts of the system that may need to be addressed and prioritized alongside new feature development work.

Category is… Operational Maturity Model

The assessment goes over our Operational Maturity Model which lays out the dimensions and scales of operational maturity at The Times along six categories:

Monitoring and Metrics: Does the service have defined service-level indicators, objectives and agreements in place (things like uptime or ability to handle a certain number of requests per second)? Are data about those metrics visualized consistently and in an easy place to find by the team and downstream system teams? Do the metrics accurately reflect the experience of users?

Capacity Planning: Do the product and engineering teams work together to forecast business metrics, such as users or sales, and convert those into supply levels like servers, storage or bandwidth? Does the team acquire capacity based on forecasted business and supply metrics, and is utilization of that capacity monitored for efficiency?

Change Management: Does the team conduct releases at a predictable cadence; do they have defined practices for reviewing changes and releases; and do they automate most or all of the work of doing releases via continuous integration and deployment tools?

Emergency Response: Does the team have an organized on-call rotation? Does the team regularly analyze automated alerting to make sure engineers aren’t bombarded with alerts that don’t require human attention? Does the team conduct effective learning reviews after incidents, with metrics, lessons learned and clear steps to prevent issues in future?

Service Provision and Decommission: Does the team use infrastructure-as code-tools? Is the tooling standardized across the team, department and organization?

Reliability: Does the team design and deploy their systems to be high availability, such as running workloads and data storage across multiple data centers or availability zones? Does the team have a documented disaster recovery plan with automated backups and regular testing? Are the team’s services deployed in multiple geographic regions or with multiple cloud providers?

The questions in each section help drive discussion within the team. This discussion aims to educate teams on best practices and inspire new ideas.

[We’re hiring for our Delivery Engineering team. Come work with us.]

As the team works through each category, they score themselves on a scale of one to five against a maturity rubric; Each section includes a range from “chaotic” to “continuous improvement.” For example, from the Reliability category’s sub-topic, Multiple Region Support:

  1. Chaotic: Service runs in a single region.
  2. Managed: Service runs in a single region but can be spun up in a second region.
  3. Defined: Services run in multiple regions with manual failover.
  4. Measured: Services run in multiple regions with automated failover.
  5. Continuous Improvement: Services run in multiple regions, across clouds with automated failover.

After completing the assessment, the team is presented with a report that outlines areas of strength and areas for improvement, compared to all other engineering teams at The Times.

Not only does this process encourage engineering teams to discuss their strengths and weaknesses in a structured way, it gives them real data about technical and architectural debt that needs to be prioritized alongside new feature work. This also helps teams communicate their reliability improvement needs to company leadership.

Everybody say “trends!”

Our team, Operations Engineering, collects these team-level scores and collates them on a dashboard that allows us to see trends across the engineering organization. In aggregate, trends help us, and the leaders in engineering and product, understand where the organization as a whole needs to focus time and resources to drive improvement.

In the visualization below, each column belongs to a category, while each row belongs to a team. The columns show us cross-team trends within each metric. By using a color range of light orange to dark orange, we can see how well teams perform in a particular metric.

A grid heatmap where rows represent a team’s scores across all parts of the assessment and columns represent cross-team scores in a single topic. Some columns show higher overall scores like Change Process Automation, where others show lower overall scores like SLO Definition and Measurement and Disaster Recovery.
Sample of scores from teams that completed the assessment. Darker colors indicate lower scores.

We have relatively high scores in the Change Management (third column group, from left) and Service Provision and Decommission (fifth column group, from left) sections, which are a direct result of previous organizational initiatives for improvement.

Other categories, such as Monitoring and Metrics (first column group, from left), indicate an opportunity for improvement. The column for SLO Definition and Measurement, within the Monitoring and Metrics group, shows weaker scores overall, which helped us clarify a gap and allowed us to expand our Observability team. In the Reliability column group (far right), the Disaster Recovery column indicates a weakness in defining disaster recovery plans. We are beginning to push teams to build their disaster recovery plans to improve our overall reliability.

You betta werk!

As The Times continues to grow its subscription base, more and more readers come to our platforms for news. This means it is increasingly important for our teams to be ready for breaking news and system failures. For the last three American election cycles, we convened a cross-engineering readiness team to manage the efforts to prepare our systems for those events, which involved significant heroics on the part of those who joined.

By defining system maturity and reliability standards, and then encouraging consistent operational thinking for all engineering teams at The Times, our hope is that we eliminate the need for heroic efforts in the future. Our ultimate goal is to make high-traffic news stories non-events, where our systems are mature enough to handle the load without the need for extra attention beyond marveling at the traffic charts.

While the assessment today is a collaborative and fairly detailed exercise, we want to automate as much as possible. Our team is looking to augment our existing capabilities, and add more monitoring, metrics and change process automation (think deploy-on-green, canary deployments and automated rollbacks). Eventually we want to incorporate more mature practices like Chaos Engineering. This stuff won’t replace collaborating with teams, but it will augment the process with better data.

If you are interested in solving any of these problems with us, please check out our roles on the Delivery Engineering team at https://nytco.com/careers. We’d love to speak with you.

Tim Burke is a Senior Software Engineer and Site Reliability Engineer in our Delivery Engineering, Operations Engineering team, is a RuPaul’s Drag Race obsessed superfan, and is a frequent contributor to The Times’s #just-yelling channel in Slack.

Special thanks to Vinessa Wan, Shawn Bower and the Delivery Engineering team.

We Don’t Get Bitter, We Get Better was originally published in NYT Open on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: New York Times

Leave a Reply

Your email address will not be published.