Cassandra Compaction Trade Offs

Cassandra Compaction Trade-Offs

By Apache Software Foundation — https://svn.apache.org/repos/asf/cassandra/logo/cassandra.svg, Apache License 2.0, https://commons.wikimedia.org/w/index.php?curid=25112706

Cassandra at OLX

At OLX chat we need to save the user chats for a long time. Like Whatsapp we have a lot of users interacting with each other, sending millions of messages per day, but unlike Whatsapp our users can log in from multiple frontend clients and hence we also need to save their chat history on our servers. Since a user tends to access his/her data very frequently, the user should be able to do so in as little time as possible.

The data model that we deal with is very easily shardable and hence we had decided to go for Cassandra. Cassandra has been serving us for our archival needs since 2017 after migrating from a legacy system using SQL, after realizing that we are not going to use the ACID goodies that Relational DBMS has been designed for. Each user only needs his data, and hence the data can easily be partitioned and distributed using the user id. It would have been a match made in heaven if our data would have been immutable, but since a lot of times additional states have to be associated with the messages, we also need to update the individual chat messages. Using Cassandra we can write a lot of data rapidly and access it at a later point in time without much of a hassle. Cassandra can manage this by first appending the data in memory and later writing it to the disks as a single structured log file (called an SSTable). After a bunch of these SSTables have been accumulated, it runs a background process (henceforth referred to as compaction in this post) which compacts a bunch of these logs, into one or more SSTables. In this blog, I am going to describe the problems which we faced due to sub-optimal compaction strategy choices, how we identified those problems, the steps we took to mitigate those, and the results we have seen so far.

An introduction to Compactions

In the past decade, a lot of DB systems have matured for production usage, prominent among them are MongoDB, Cassandra, Redis, ElasticSearch, etc. These can be broadly categorized in a key-value store, document store, and wide columnar store. While the way of storage differs, the central philosophy remains the same i.e. no dependence between two data entities at the database level.

A lot of the database engines developed(like Cassandra, RocksDB, WiredTiger), work based on LSM tree data structure. Since all the writes are lumped into memory and then written to the disk-based on the size and time threshold, the writes are pretty fast. When writing, the keys are sorted and written in a single disk file (SSTable) for easy retrieval. After multiple SSTables are accumulated, these SSTables are lumped together to merge the data having the same partition key, the data is then aggregated and again written in a new SSTable. The act aggregating the data in SSTables and writing the data in another new SSTable is called compaction. At the end of compaction, all the data with the same partition key is colocated for easy search and retrieval.

A simplified compaction. Each key is kept in the table sorted, in this case lexicographically

As you can see that once a key has been written it does not necessarily remain in the same SSTable and this leads to amplification. The way data is written in Cassandra, any writes to a key leads to multiple rewriting of the same data. This is an undesirable behavior in a DB system, even if necessary. In analytical terms, we usually refer to this as the write amplification factor.

write amplification factor (waf) = physical writes/ logical writes

A similar thing happens when we try to read a key. The row elements might be scattered across various SSTables. One key access might lead to multiple disks seeks. This is quantified by the term read amplification factor.

read amplification factor (raf)= physical reads / logical reads

The desirous behavior in any DB system is to reduce these to a minimum. If we talk about SQL ( on a DB level, and ignore the physical media complexity), the read and write amplification are generally one. However, in the case of Cassandra, both (raf and waf) can vary and are based on the compaction strategy used.

Cassandra has three types of compaction. All three make the tradeoff between reading time latency and disk io.

Triggered when multiple (default value of 4) SSTables of the similar size are present. It works great if you have immutable (because there will be no spread of data in SSTables) and unpredictable data writing rate(relatively speaking the write amplification is directly proportional to the data written). If you have mutable data, it will lead to considerable read amplification as the spread of the data might be large. In the worst-case scenario, it might lead to reading all the SSTables. However, the write amplification is predictable to the size of given data. low waf, high raf

Compact all the SSTables based on the time window (say 15 days) in one big SSTable. All the fresh SSTable (younger than 15 days) are compacted according to STCS. This is great if you have predictable data demands and your key data would not be mutated after a given time period. The pros and cons are mostly the same as STCS. low waf, high raf

Compact SSTables based on levels. Each level is 10 times more than the previous level. Once the number of SSTables in a level exceeds its level count, the SSTables are compacted with the SSTables of the next level. The way LCS has been designed ensures that the spread of the data is directly proportional to the levels that will be created, so the amount of raf is consistent. low raf, high waf

Problem and Diagnosis

When we started with Cassandra we naively put in TWCS as the compaction strategy. This became problematic for us in due time, due to huge raf. This manifested with a random alert, sustained CPU spikes, and memory pressure. Fortunately for us, Cassandra comes with a default tool (nodetool), which helped us in diagnosing the issue. I will take the example of two tables, one chats where we store chats of a user, and threads where we store the state of a conversation. We want to analyze

  1. the spread of data in the SSTables (characterizing raf)
  2. the latency parameters for reads
  3. the number of SSTables for each table.

Following is the output after running commands for both chats and threads table.

Chats Table Measurement

Output of nodetool tablehistograms india.chats

india/chats histograms

The associated table info is nodetool tablestats india chats

Keyspace : india
Read Count: 8454739078
Read Latency: 2.5936929556960835 ms
Write Count: 8263045277
Write Latency: 0.0439760641604433 ms
Pending Flushes: 0
Table: chats
SSTable count: 24
                ----------------------------------

For the chats table our measures are:

  1. raf(50%) = 12
  2. latency (99%) = 12 ms
  3. count of SSTables = 24

Threads Table Measurement

Output of nodetool tablehistograms india.threads

Output of table info for threads is ( nodetool tablestats india chats )

Keyspace : india
Read Count: 8454742954
Read Latency: 2.593692375761729 ms
Write Count: 8263051337
Write Latency: 0.04397605945722324 ms
Pending Flushes: 0
Table: threads
SSTable count: 25
        ----------------------------------

For the chats table our measures are:

  1. raf(50%) = 6
  2. count of SSTables = 25
  3. latency (99%) = 4 ms

There is a lot to cover in the metrics but what interests us is the number of SSTables being touched for servicing 50% of the request is higher than half of SSTables. This may be different for differing cases, but in our case, a huge data spread was taking place. Since the strategy we used was STCS, we knew that for a lower raf we had to change the compaction strategy. This decision was also partially based on the fact that:

  1. Our reads to write ratio was ranging from 2:1 for some tables to as big as 20:1 for some heavily used client-facing tables, since we were servicing that many reads it was more prudent and scalable to optimize for reads rather than writes.
  2. We had spare storage and io which would allow us to dedicate our excessive io to compaction.

These observations led to an inquiry into the change of the compaction strategy. We concluded that since we can go along with write amplification with background writes, we should change the compaction strategy.

Improvements

Thus we can see that LCS offers better read time latency with extra write amplification, it also consumes less space during compaction as compared to STCS. The results after changing the compaction strategy are as follows

Chats Table Measurement

india/chats histograms
Keyspace : india
Read Count: 823153195
Read Latency: 0.6837681826965393 ms
Write Count: 1064722103
Write Latency: 0.0379458767364389 ms
Pending Flushes: 0
Table: chats
SSTable count: 1198
SSTables in each level: [1, 19/10, 209/100, 969, 0, 0, 0, 0, 0]
----------------------------------

After changing the strategy our measures for chats table are

  1. raf(50%) = 3
  2. count of SSTables = 1198
  3. latency (99%) = 4 ms

Threads Table Measurement

india/threads histograms
Total number of tables: 110
----------------
Keyspace : india
Read Count: 883213588
Read Latency: 0.6891794469810625 ms
Write Count: 1139734122
Write Latency: 0.038055260776863886 ms
Pending Flushes: 0
Table: threads
SSTable count: 744
SSTables in each level: [0, 20/10, 206/100, 518, 0, 0, 0, 0, 0]

----------------------------------

And our measures for threads table become

  1. raf(50%) = 3
  2. count of SSTables = 774
  3. latency (99%) = 4 ms

Some inferences that can be drawn from the results shared above:

  1. The number of SSTables has reduced a lot for a single read in absolute terms, this is a much-desired improvement because most of the reads can be serviced using a few SSTables, saving on the precious read io.

2. The read latency has also come down by a factor for threads and chats table across all the %iles. This is a direct reflection of the server doing less work in io and the need to refer to fewer SSTables.

3. The number of SSTables has increased by a lot, but this is due to the very nature of LCS compaction. This is tolerable for us because we have spare io bandwidth where these tables can be compacted in the background.

Conclusion

Cassandra offers a lot of features. However a lot of caveats should be kept in mind while using Cassandra, particularly relating to the io usage, compactions and tuning of internal parameters. This post show some of the steps we took in improving the performance of our cluster. Some of the lessons that we learnt gradually are

  1. Choose a compaction strategy according to the use case. As can be seen, we reduced our data spread by a considerable number, reducing our read latencies.
  2. Additionally choosing the correct cloud instance, for io bound application is very important as you will need a lot of bandwidth to accommodate your compacting needs (we used i3 type instances in AWS).

References

http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html


Cassandra Compaction Trade Offs was originally published in OLX Group Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: OLX

Leave a Reply

Your email address will not be published.


*