Migrating edge network providers

Unknown to our users, we recently migrated edge network providers. This
involved some particularly interesting problems that we needed to solve
in order to migrate without impacting availability or integrity of our
services.

Before we get into how we made the move, let’s look at what an edge
network actually does.

An edge network allows us to serve content to users from a physically
nearby location. This allows us to deliver the content in a fast, secure
manner by avoiding sending every request to our origin infrastructure,
which may be physically distant from users.

On the security front, using an edge provider allows us to perform
security mitigations without tying up our origin infrastructure
resources. This becomes quite important when we start talking about
Distributed Denial of Service (DDoS for short) attacks that aim to
saturate your network and consume all of your compute resources making
it difficult for legitimate users to visit your site. By offloading the
defense against malicious traffic and mitigation work to a set of
purpose built servers distributed across the globe, you free up your
origin resources to do what they need to do and service your users.

DDoS + WAF

Malicious users are a very real threat and something we deal with on a
daily basis. The majority of these attacks aren’t volumetric however
they can impact other users if they manage to generate enough requests
to slow down a particular part of our service ecosystem.

Depending on the type of malicious traffic, we have two options: a Web
Application Firewall (WAF) and DDoS scrubbing.

WAF is used for most of our mitigations. This is a series of rule
sets that have been developed over the years based on attacks that we’ve
seen against our services. We “fingerprint” a large sample of requests
and then extract out common traits to either block the traffic,
tarpit the request or perform a challenge that
requires human interaction to proceed.

DDoS scrubbing comes into play when we have a highly distributed
attack or we are seeing high volumes of network traffic. The goal of
this mitigation is to filter out the malicious requests (much like using
WAF) however it is usually done far more aggressively and involves
inspecting other aspects than just HTTP.

Prior to the move, these were two separate systems and the request flow
looked like this:

This setup wasn’t perfect and lead to a few issues.

  • Debugging was very difficult. To get to the bottom of any request
    issues, we needed to use both systems to piece together the full
    picture. While both systems had correlation IDs that we could map
    against each other, it was easy to get confused which part of the
    request/response you were looking at in either system.
  • Coordinating changes was hard. As we added additional features to
    either our DDoS defense or WAF we needed to do some extra work to
    ensure that rolling out changes in one would continue to work with the
    other.
  • API differences. We are big users and advocates for Infrastructure
    as Code, however only some of our service providers offered this
    facility and even then only for limited portions of their services.
    This resulted in us either needing to use the UI with manual reviews
    or only storing part of it in code which added confusion on what went
    where.
  • Getting blocked in one system could be misunderstood by the other
    system
    . If a user managed to trigger one of our WAF rules, the DDoS
    system that had some basic origin healthcheck capabilities could read
    the response as the origin being under load and start throwing
    confusing errors. This would get in the way of finding the real issue
    as you would get errors from both systems instead of just a single
    one.

So, we set out to combine the two systems into a single port of call for
our traffic mitigation needs.

DNS

We maintain both internal and external DNS services. For our internal
DNS, we use AWS Route53 as that is already well integrated with our
infrastructure. However, externally we needed something that would do
all the standard stuff plus cloak our origin and prevent recursive
lookups from finding the origin.

Something else we wanted to improve was the auditability of our DNS zone
changes. Our existing DNS provider didn’t lend itself very well to
managing the records as code. This resulted in changes needing to be
staged in a UI and then posted to Slack channels for review from other
engineers before being committed. Managing our DNS in code would help us
level up in our security practices because it would keep DNS changes
easily searchable and aid mitigating vectors like dangling DNS
vulnerabilities.

Preparing for the move

One of our biggest concerns with migrating these services was conformity
between the new and old. Having discrepancies between the two systems
could cause a bunch of issues that if not monitored, would create bigger
issues for us.

We decided that we would address this in the same way we prevent
regressions in our applications; we would build out a test suite. Our
engineering teams are very autonomous which meant that this test suite
needed to be easy to understand and use by the majority of the
engineering team since they could be potentially making changes and
would need to verify behaviour.

After some discussions, we landed on RSpec. RSpec
is already a well understood framework in our test suites and the
majority of our teams are using it daily. Even though using RSpec would
get us most of the way, we would still need to extend it to add
support for the expected HTTP interactions and conditions. To do this,
we wrote HttpSpec. This is our HTTP RSpec library that performs the
underlying HTTP request/response lifecycle and has a bunch of custom
matchers for methods, statuses, caching, internal routing and protocol
negotiation. Here is an example of something you might see in our test
suite:

1
2
3
4
5
6
7
8
9
10
11
12
# HTTP/S
RSpec.describe 'themeforest.net' do
  it { is_expected.to redirect_http_to_https }
  it { is_expected.to support_tlsv1_2 }
end

# Caching
RSpec.describe 'themeforest.net/category/all' do
  it { is_expected.to be_browser_cachable }
  it { is_expected.to be_shared_cachable }
  it { is_expected.to serve_stale_while_revalidating(ttl: 60) }
end

This solved the issue for most of the functionality we were looking to
port however we still didn’t have a solution for DNS. We started putting
together a proof of concept that relied on parsing dig responses and a
short while later decided that wasn’t scalable to our configuration due
to the number of variations that could be encountered. This prompted us
to go in search for a more maintainable tool. Lucky for us, Spotify had
already solved this issue and open sourced rspec-dns.
rspec-dns was a great option for us since it could be integrated into
our existing RSpec test suites and gave us the same benefits we wanted
in our edge test suite. This is what our DNS tests looked like:

1
2
3
4
5
6
7
8
9
10
RSpec.describe 'themeforest.net' do
  # CNAME
  it { is_expected.to have_dns.with_type('CNAME').with_rdata('origin.hostname') }

  # TXT
  it { is_expected.to have_dns.with_type('TXT').with_data('v=spf1 include:spf.mandrillapp.com -all') }

  # MX
  it { is_expected.to have_dns.with_type('MX').with_exchange('aspmx.l.google.com').with_preference(1) }
end

Now that we had a way of confirming behaviour on both systems, we were
ready to migrate!

Making the move

The second big issue we hit was that the two providers didn’t use the
same terminology. This meant that a “zone” in provider A wasn’t
necessary going to be the same thing in provider B.

Remedying this wasn’t a straight forward process and required a fair
amount of documentation diving and experimentation with both providers.
In the end, we built a CLI tool that took the API responses from our old
provider and mapped then to what our new provider expected to manage
the equivalent resources. This greatly reduced the chance of human error
when migrating these resources and ensured that we would be able to
reliably create and destroy resources over and over again. An upside of
taking this automated approach is that we could couple resource
creation with spec creation. For instance, if the CLI tooling found a
DNS record in provider A it would also update our specs to include an
assertion based on what was going to be created in provider B (Yay, for
free test coverage!)

Our approach to cutting over sites followed standard Test Driven
Development
process:

  • Create the expected tests
  • Run our test suite and see them all fail
  • Port over the functionality to the new provider
  • Re-run the test suite and see what is missing
  • Rinse and repeat

As an additional safety measure, we configured our edge and DNS test
suite to run hourly (outside of regular Pull Request triggered builds)
and trigger notifications for any failures. This ensured that we were
constantly getting feedback on a quickly changing system if we broke
anything.

To keep the blast radius of changes as small as possible while we were
gaining confidence in the migration process, we migrated the systems in
order of traffic and their potential for customer impact. By taking this
approach, we were able to give stakeholders confidence we were able to
bring over larger systems without impacting users.

Once we were happy the site was working as expected we would release it
to our staff using split horizon DNS to gain
some confidence that there wasn’t anything missed. If a regression was
found, we’d go back through the TDD process until we were completely
confident in our changes.

After we were happy with the testing, we’d schedule some time to perform
the public cut over. On cut over day, the migration team would jump into
a Hangout and start stepping through the runbook and monitoring for any
abnormal changes.

A caveat to note about DNS NS record TTLs: Despite taking precautions
such as lowering the NS TTLs weeks before hand, the TTL for NS records
are pretty well ignored by most implementations. This means that while
you may cut over at 8am on a Monday, the change may see a long tail
until full propagation is achieved. In our case, up to 3 or 4 days in
some regions. For this reason we introduced a new 24×7 on call roster
that would help the system owners mitigate this issue should we have
needed to roll back the cut over.

Final thoughts

Embarking on a migration project for your edge network provider is no
small feat and it definitely isn’t without risks. However, we are
extremely pleased thus far with the improvements and added
functionality that we have gained from the move.

In the future, we will be looking to integrate our edge provider closer
with our origin infrastructure. The intention behind this is to automate
away some of the manual intervention we currently perform when applying
traffic mitigation. In the long term this will help us build a safer and
more resilient Envato ecosystem.

Source: Envato