That’s exactly what we hoped for.
By Jeremy Gayed, Said Ketchman, Oleksii Khliupin, Ronny Wang and Sergey Zheznyakovskiy
If you’ve read an article on The New York Times website any time in the past eight years, you’ve probably seen a little message telling you how many free articles you have left to read. You may also have seen our paywall: that big box that pops up and asks you to register or subscribe once you’ve hit your monthly article limit. This is our Meter Service and it’s an important part of our subscription-first business model.
The Meter Service has been around since we first launched our paywall in 2011 and it determines if our readers can access the hundreds of articles we publish every day. It also handles over 30 million user requests daily and it is the gatekeeper for how we acquire new subscribers. In early 2018, we decided it needed to be rewritten. To up the stakes, we had to ensure there was zero interruption to the business while also writing it to scale for future needs.
No big deal, right?
To be clear, we don’t take rewrites lightly. If they can be avoided, then it’s probably best to avoid them. We combed through old documentation for the Meter, talked to people who had been involved throughout the years and determined what improvements it needed to continue to meet our business objectives. What we had on our hands was an eight-year-old application that had been in maintenance mode for most of its life. The original creators were long gone, and it was built at a time when the industry doubted whether a subscription-based digital news service would even work.
When we cleared the cobwebs, what we found was a complex NGINX configuration that fronted a parsing engine in C and an architecture that couldn’t autoscale, all deployed to Amazon Web Services (AWS). The only saving grace was that the XML-based metering rules were versatile enough to meet our current business logic needs. It was obvious that we were years behind the type of architecture we develop at The Times today.
To ensure that the Meter could scale, we needed business logic consolidated within the service itself, instead of spread out across our apps and website; Our Android app had its own metering service separate from the main service. Additionally, we needed the ability to quickly test and implement new metering rules and logic, which was something the XML configuration and C engine made extremely difficult to do — it was time consuming to make, test and deploy changes, and often prone to errors. Most importantly, we needed confidence that the system was working as expected.
The changes we wanted to make
A lot has changed in the years since we first launched the paywall, especially with how we architect our applications. We knew we needed to update the Meter with more modern technology, like a properly auto-scaled environment that could provide better latency for our application. Previously we had to pre-warm AWS load balancers ahead of anticipated traffic spikes, such as the elections. We wanted to avoid that by moving away from our over-provisioned cluster in AWS, to a newer auto-scaled Kubernetes infrastructure in Google Cloud Platform (GCP).
We don’t write applications in C anymore, so we needed a new language that wouldn’t mean compromising on speed and stability, yet was more widely used at The Times and closer to the core technical competencies of our engineering teams. We decided on Go.
Logging and metrics are a critical part of our current applications, and the Meter needed to be brought up to date. To be confident in its performance, we determined that having proper Service Level Objectives, or SLOs, would help in this effort. When something goes wrong, we need to see why the problem is happening and if action needs to be taken — PagerDuty works well for that.
It’s all about the preparation
It’s tempting to jump right in once you’ve architected a solution, but if you’ve ever done that before, then you know that it can lead to a lot of problems down the road. Getting different perspectives to vet our solution and make sure we had a plan for long-term stability and support were all essential for a smooth launch. Here are a few of the areas we made sure to cover.
1. Request for Comments process
The Times engineering group believes strongly in the Request for Comments (RFC) process because it helps bring visibility to a major architectural change and engages colleagues in decision making. An RFC is not very complicated; it is simply a technical document that proposes a detailed solution to a specific problem. We shared our RFC with all of our colleagues in the Technology department and asked for their feedback.
Before we started coding, we spent several weeks preparing a couple of RFCs that outlined our plan to rewrite the Meter Service. This included documenting the current system and dependencies, analysis of multiple proofs of concept and creating clear architectural diagrams of our new solution. In the end, we were able to tighten up our plan and identify possible edge cases that we hadn’t thought through. The time it took to solicit feedback was worth it.
2. Decide on a testing plan
The beauty of rewriting a service is you can make sure it’s done right from the start. We wanted to have several testing layers that we could automate (because who enjoys manual testing) and included coverage from unit tests, contract tests, load testing and end-to-end functional testing.
Remember, this was a high-stakes replatform and we couldn’t mess it up. So, we needed a way to definitively demonstrate that the new service wouldn’t break existing integrations or business behavior. With our testing strategy vetted by our RFC process, we could prove feature parity with the old meter and be confident in our comprehensive test suite.
3. Making sure the lights are on (i.e. monitoring)
The lack of visibility into the old Meter’s performance and stability was a problem for both technical and business reasons: the Meter is a key way we incentivize subscriptions purchases. The old service didn’t provide much visibility into performance, so we were never really sure how well it was working.
Since we had little to work with, we decided we wanted to better understand how the existing service performed for our users. We put in place robust Real User Monitoring (RUM) to capture response times, error and success rates. This helped us determine not only what to watch for, but what benchmarks we should set for acceptable behavior.
Before rolling out our new service, we put in place dashboards in Chartio (for client-side data) and Stackdriver (for service-level data) that allowed us to observe any changes to existing metrics. With this in place, we had yet another method for ensuring we could prove our rollout wasn’t going to break anything.
Launch the service, make sure no one notices
The prep work we did allowed us to move extremely quickly to rebuild the service, spin up the new infrastructure and create a solid deployment pipeline.
The next step was to launch, but we wanted to make sure we did it right to avoid a high-stress and high-stakes situation. We couldn’t afford to have our launch end up like Indiana Jones trying to snatch the idol treasure in Raiders of the Lost Ark, so we came up with a tactical approach.
Our challenge was to guarantee the new Meter Service performs at par or better than the old service, and to ensure the response is identical between the two services for every user. To achieve this, we had both services run simultaneously, though the new service didn’t impact users at first. We called this our “dark rollout.”
Setting up a dark rollout
The graphic above shows the call to the legacy service and the additional silent call we added to the web client that went to the new service. The call was non-blocking and had no impact on the actual user experience. The responses received from each service were tracked and compared, and we used this opportunity to load-test the new service to see how well it performed during real news-related traffic spikes.
The benefit of this approach was that the legacy service was still functioning throughout, which meant we had the freedom to modify the new service without worrying about impacting users. We could even take it down if we needed to make configuration changes to the infrastructure, such as auto-scaling policies or instance sizes. We let this run for a couple weeks until we ironed out the last bugs and felt confident that we could proceed with a phased rollout.
Validating the new service with a single client
We accomplished what we needed from the dark rollout and felt ready to start relying on the response from the new service. Before we fully launched it, however, we did an initial test on The Times’s website to be extra safe. Within the web client integration itself, we routed one percent of traffic to the new service and the remaining 99 percent to the legacy service. Additionally, we added a back-end validation layer, which allowed us to compare the new and legacy service results to make sure they completely matched.
Validating all clients
Once we knew we were good on our website, we moved on to test all other applications. At this point we knew the service worked, but thought that a gradual rollout was a smart way to have a smooth transition. There are a lot of ways to accomplish this, but we wanted to avoid having to update code in each integration. We opted to do it on the DNS level, pointing one percent of all traffic to the new infrastructure and 99 percent to the old. Then we watched and waited.
Flip a coin: 50 percent rollout
A huge milestone for us was the ramp up to 50 percent of all traffic to The Times’s web offerings using our new service. Things were humming along pretty well until we discovered a problem.
Our AMP platform wasn’t handling requests properly and wasn’t metering users at all. That was great for users, but terrible for business. In short, AMP requires custom headers to be included in the response our service provides, but we weren’t providing them.
After spending a day investigating, we realized these headers weren’t provided in the original legacy C code, but rather in the NGINX layer. We missed something big, but because we could easily throttle traffic to the new service, we had a disaster plan ready. We scaled down traffic to the new service and quickly added in the necessary headers before ramping back up to 50 percent. We triple checked everything, then watched to see how it fared.
Ta-da! Everyone gets the new service
The glorious day finally arrived where we felt confident enough to turn the dial and serve all traffic from our new Meter Service. The best part was that no one even knew we had launched — though, we did inform senior leadership and stakeholders.
Part of launching a replatform is retiring the old systems, so we thanked the old system for its service and officially said goodbye to the old AWS stack. Overall our entire rollout took about a month. While that may seem long, it ensured we were completely confident that our new solution works. We avoided having a chaotic launch and actually felt a little awkward with how little drama there was.
It’s now been a year since our launch and because of our approach, we have had a lot of time to improve the system, rather than chase production issues. We were even able to scale during record breaking United States midterm election traffic, handling over 40 times our normal traffic.
This project allowed us to hone our playbook for how to handle launches:
- Peer review your plan before building.
- Ensure proper test coverage for key moving pieces.
- Ensure you have metrics and observability in place to watch for issues.
- Consider a very gradual rollout (even a dark rollout) instead of a full cut-over.
- Keep an eye out for anomalies during and after launch, and have an easy way to roll back if any issues come up.
We’re trying hard to make great software here at The Times and are always looking to improve. Hopefully we’ve inspired you to do the same!
Jeremy Gayed is Lead Software Engineer focused on evolving The New York Times’s web architecture. Follow him on Twitter.
Said Ketchman is a Senior Engineering Manager overseeing the meter and marketing technology teams at The New York Times.
Ronny Wang is a Software Engineer on the meter team.
Sergey Zheznyakovskiy is Lead Software Engineer on the meter team.
Oleksii Khliupin is a Senior Software Engineer formerly on the meter team.
We Re-Launched The New York Times Paywall and No One Noticed was originally published in Times Open on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source: New York Times