Compaction in Cassandra

Overview

A write in Cassandra is immediately written to commit logs and memtable, and once this memtable is full it is flushed on to disk in the form of SSTables. For each memtable a new SSTable is created. The process of merging these SSTables to improve the performance of partition scans and to reclaim space from deleted data (tombstone) is called compaction in Cassandra.

Full compaction: A full compaction applies only to SizeTieredCompactionStrategy. It merges all SSTables into one large SSTable.

What happens during Compaction?

Compaction is a periodic background process. During compaction Cassandra merges SSTables by combining row fragments, evicting expired tombstones, and rebuilding indexes. As SSTables are sorted, the merge is efficient (no random disk I/O). After a newly merged SSTable is complete, the input SSTables are reference counted and removed as soon as possible. During compaction, there is a temporary spike in disk space usage and disk I/O.

Impact of compaction

Compaction impacts both read and write performance. During compaction higher disk I/O and utilization causes a little performance degradation.

Impact on read: Compaction can impact reads that are not fulfilled by the cache due to high disk I/O, however, after a compaction completes, off-cache read performance improves since there are fewer SSTable files on disk that need to be checked

Impact on disk: A full compaction is I/O and CPU intensive and can temporarily require double disk space when no old versions or tombstones are evicted. It is thus advisable to have disk double the size of your data.

Compaction also results in disk space getting reclaimed when major data in sstables is present in form of tombstones

Types of Compaction Strategies

Cassandra provides two compaction strategies:

SizeTieredCompactionStrategy: This is the default compaction strategy. This strategy gathers SSTable of similar size (default 4) and compacts them together into a larger SSTable.

In figure 1, each green box represents an sstable, and the arrow represents compaction. As new sstables are created, nothing happens at first. Once there are four, they are compacted together, and again when we have another four.

Figure 2 shows what we expect much later, when the second-tier sstables have been combined to create third-tier, third tier combined to create fourth, and so on.

LeveledCompactionStrategy: Introduced in Cassandra 1.0, this strategy creates SSTables of a fixed, relatively small size (5 MB by default) that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the previous.

In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4.

What problems does Leveled compaction solves?

There are three problems with size-tiered compaction in update-heavy workloads:

Read performance can be inconsistent because there are no guarantees as to how many sstables a row may be spread across: in the worst case, we could have columns from a given row in each sstable.
A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns (tombstone) will be merged out of existence; this is particularly noticeable when there is a high ratio of deletes.
Space can also be a problem as sstables grow larger from repeated compactions, since an obsolete sstable cannot be removed until the merged sstable is completely written. In the worst case of a single set of large sstable with no obsolete rows to remove, Cassandra would need 100% as much free space as is used by the sstables being compacted, into which to write the merged one.

Leveled compaction solves the above problems with tiered compaction as follows:

Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels — e.g., 7 for 10TB of data.
At most 10% of space will be wasted by obsolete rows.
Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.

When to use which compaction

Both compaction strategies are good depending on the use case. It is thus important to understand when to use which compaction strategy.

When to use Leveled Compaction

Following are the scenarios in which use of leveled compaction is recommended

1. High Sensitivity to Read Latency: In addition to lowering read latencies in general, leveled compaction lowers the amount of variability in read latencies. If your application has strict latency requirements for the 99th percentile, leveled compaction may be the only reliable way to meet those requirements, because it gives you a known upper bound on the number of SSTables that must be read.

2. High Read/Write Ratio: If you perform at least twice as many reads as you do writes, leveled compaction may actually save you disk I/O, despite consuming more I/O for compaction. This is especially true if your reads are fairly random and don’t focus on a single, hot dataset.

3. Rows Are Frequently Updated: If you have wide rows where new columns are constantly added then it will be spread across multiple SSTables with size-tired compaction. Leveled compaction, on the other hand, keeps the number of SSTables that the row is spread across very low, even with frequent row updates. So, if you’re using wide rows then leveled compaction can be useful.

4. Deletions or TTLed Columns in Wide Rows: If you maintain event timelines in wide rows and set TTLs on the columns in order to limit the timeline to a window of recent time, those columns will be replaced by tombstones when they expire. Similarly, if you use a wide row as a work queue, deleting columns after they have been processed, you will wind up with a lot of tombstones in each row. So, even if there aren’t many live columns still in a row, there may be tombstones for that row spread across extra SSTables; these too need to be read by Cassandra, potentially requiring additional disk seeks. In this scenario Leveled Compaction is a better choice as it limits the row spreading and frequently removes the tombstones while merging.

When to use Size Tiered Compaction

Following are the scenarios in which size tiered compaction is recommended:

1. Write-heavy Workloads: If you have more writes than reads then it is recommended to use size tiered as it may be difficult for leveled compaction to keep up with write-heavy workloads, and because reads are infrequent, there is little benefit to the extra compaction I/O.

2. Rows Are Write-Once: If your rows are always written entirely at once and are never updated, they will naturally always be contained by a single SSTable when using size-tiered compaction. Thus, there’s really nothing to gain from leveled compaction.

3. Low I/O performance: If your cluster is already pressed for I/O, switching to leveled compaction will almost certainly only worsen the problem, so it’s better to use size tiered when disk IOPS is not good.

Verification Test

Following test can be performed to verify the impact of different compaction strategies on read performance and disk usage

1. Create a column family in Cassandra for storing user’s status updates. Keep user-id as row-key and status as columns value, current timestamp as column name.

2. Set a TTL on status column to remove a status older than 5 minutes.

3. Set default compaction strategy of size-tiered compaction.

4. Keep inserting lots of message for the same user for about an hour

5. Meanwhile, from another program read the status updates for the same user.

6. Watch the read latency and disk usage parameters over the time.

7. Perform the same test with leveled compaction strategy and note the difference

The Big Elephant - Hadoop

Search This Blog

The Big Elephant

Friday, February 6, 2015