Speeding things up with Redshift

Recently we’ve started to experiment with using Redshift, Amazon’s
new data warehousing service. More specifically, we’re using it to
speed up and expand our ad hoc data analysis.

The Challenge

bitly sees billions of clicks and shortens each month. Often we have
various questions about the data generated from this activity. Sometimes
these questions are driven by business needs (how much traffic do we see from
a potential enterprise customer), sometimes they are more technically driven
(how much traffic will a new sub-system need to deal with), and sometimes we
like to just have fun (what are the top trashy celeb stories this week).

Unfortunately, when working with that volume of data it can be pretty
difficult to do much of anything quickly. Pre-Redshift, all of these questions
were answered by writing map-reduce jobs to be run on our Hadoop cluster or on
Amazon’s EMR.

Whenever we wanted to answer a question with our data, the process would look
something like this:

  1. Write map-reduce job in Python
  2. Run it on some local test data
  3. Fix bugs.
  4. Run it on the Hadoop cluster
  5. Wait 20-30 minutes for results
  6. Get an error back from Hadoop
  7. Dig through the logs to find the error.
  8. GOTO 3

This is clearly not ideal when all you want to do is get a simple count.

For a lot of the work we do Hadoop + Python make for an awesome combination,
but for these ad hoc aggregation queries they’re very blunt instruments. In
both cases, they are general purpose tools that are super flexible, but
slow and difficult to use for this specific use case.

Redshift, on the other hand, is specifically built and optimized for doing
aggregation queries over large sets of data. When we want to answer a
question with Redshift, we just write a SQL query and get an answer within a
few minutes—if not seconds.

Overall, our experience with Redshift has been a positive one but we have run
into some gotchas that we’ll get into below.

The Good News

User Experience

From a user perspective, we’re really happy with Redshift. Any one of our
developers or data scientists just need to write a SQL query and they have an
answer to their question in less than 5 minutes. Moving from our old hadoop
based workflow to an interactive console session with Redshift is a major
improvement.

Additionally, since much of the user facing bits of Redshift are based on
PostgreSQL there is a large ecosystem of mature, well-documented tools and
libraries for us to take advantage of.

Finally, while it can be a bit slow at times, we’ve been very impressed with
the web management console Amazon provides with Redshift. For a 1.0 product,
the console is comprehensive and offers much more information than we expected
it to.

Performance

For our current use case of ad hoc research queries, Redshift’s performance is
adequate. Most queries return a response in less than five minutes and we
rarely have many users executing queries concurrently.

That being said, we have done some experimentation with competing products
(e.g. Vertica) and have seen better performance out of those tools. This is
especially true for more complex queries that benefit from
projections/secondary indexes and situations where the cluster’s resources are
under contention.

Documentation

Just like the rest of AWS, Amazon provides reasonably comprehensive and
thorough documentation. For everything that is directly exposed to us
as users (e.g. loading operations, configuration params, etc) we are very
happy with the documentation. The only places where we felt we wanted more
information were those where Amazon makes things “just work”. Most
significantly we would like to see more details about what exactly happens
when a node in the cluster fails and how the cluster is expected to behave
in that state.

Cost

We wouldn’t go so far as to call Redshift cheap, but compared to many
competitors it is pretty cost effective. The biggest gotcha here is that
while the simple model for scaling Redshift clusters and tuning performance
within a cluster is nice as a user, it does mean that you have a bit of a
one-size-fits all situation.

In our case we are computationally and I/O constrained so we’re paying for a
bunch of storage capacity and memory that we don’t use. At our current scale,
things work out okay but as we continue to grow it may make sense to take
advantage of something else that is more flexible in terms of both hardware
and tuning.

The Bad News

Data Loading

We had to spend a lot of time getting our data into Redshift. This is
partially our fault since our dataset is not the cleanest in the world,
but overall this is the place that we felt the most pain from an immature
tool chain.

The majority of bitly’s at-rest data is stored in line-oriented (i.e. one JSON
blob per line) JSON files. The primary way to load data into Redshift is to
use the COPY command to point the cluster at a pile of CSV or TSV
files stored on S3. Clearly we needed to build out some kind of tool chain to
transform our JSON logs into flat files.

Initially, we just wrote a simple Python script that would do the
transformation. Unfortunately, we quickly discovered that this simple
approach would be too slow. We estimated that it would have taken a month to
process and load all the data we wanted in Redshift.

Next, we realized that we had a tool for easily doing highly parallelized,
distributed text processing: Hadoop. Accordingly, we re-worked our quick
Python script into a hadoop job to transform our logs in a big batch. Since
we already keep a copy of our raw logs in S3, EMR proved to be a great tool
for this

Overall this process worked well but we did still run into a few gotchas
loading the flattened data into Redshift:

  • Redshift only supports a subset of UTF-8, specifically characters can only
    be 3 bytes long. Amazon does mention this in their docs,
    but it still bit us a few times.
  • varchar field lengths are in terms of bytes, not characters. In Python,
    byte strings expose these as one and the same. Unicode strings on the other
    hand do not. It took us a little while to realize this was happening and to
    get our code setup to do byte length truncations without truncating
    mid-characater in unicode strings. To get an idea of how we went about doing
    unicode aware truncation, check out this relevant
    stack overflow thread.
  • Moving floats between different systems always has some issues. In Python
    the default string output of small float values is scientific notation
    (.332e-8). Unfortunately Redshift doesn’t recognize this format so we needed
    to force our data prep job to always output floats in decimal format.

Now that we have a large body of data loaded into Redshift, we’re working on
building out tooling based on NSQ to do our data prep work in a
streaming fashion that should allow us to easily do smaller incremental
updates.

In the end, we worked through our data loading issues with Redshift but it was
one of the more acute pain points we encountered. From our conversations with
Amazon, they’re definitely aware of this and we’re interested to see what
they’ll come out with, but for now the provided tooling is pretty limited.

Availability

Long term, there are a number of periodic and online tasks that we’re thinking
about using a tool like Redshift for. Unfortunately, as things stand today we
would not be comfortable relying on Redshift as a highly available system.

Currently, if any node within a Redshift cluster fails, all outstanding
queries will fail and Amazon will automatically start replacing the failed
node. In theory this recovery should happen very quickly. The cluster
will be available for querying as soon as the replacement node is added,
but performance on the cluster will be degraded while data is restored
on to the new node.

At this point, we have no data on how well this recovery process works in
the real world. Additionally, we have concerns about how well this process
will work when there are larger issues happening within an availability zone
or region. Historically there are a number of cases where issues within one
Amazon service (e.g. EBS) have cascaded into other services leading to long
periods of unavailability or degradation.

Until there’s a significant track record behind a system like this, we’re
hesitant to trust anything that will “automatically recover”.

There is the option of running a mirrored Redshift cluster in a different AZ
or region, but that gets expensive fast. Additionally, we’d have to build
out even more tooling to make sure those two clusters stay in sync with each
other.

Limited Feature Set

Redshift is very impressive feature-wise for a 1.0 product. That being said,
a number of the competing products have been around for a while and offer some
major features that Reshift lacks.

The biggest missing feature for us with Redshift is some kind of secondary
indexing or projections. Right now, if you sort or filter by any
column other than the SORTKEY, Redshift will do a full table
scan. Technically you could create a copy of that table with a different sort
key but then it would become your problem to keep those tables in sync and to
query the right table at the right time.

Some other “missing” features include tools for working with time series data,
geospatial query tools, advanced HA features, and more mature data loading
tools.

The Bottom Line

Redshift is great for our needs today, but we’ll see if Amazon’s development
keeps up as our needs change going forward. Given Amazon’s impressive track
record for quickly iterating on and improving products we’re hopeful, but we
do have our eyes on competing products as our use of data warehousing tools
matures.

by Sean O’Connor

Source: Bitly

Leave a Reply

Your email address will not be published.


*