Topic: DDIA Chapter 3 covers why STCS favors write throughput while LCS bounds read amplification, and when each is appropriate (Cassandra uses STCS by default, RocksDB uses leveled)

Date: 2026-05-29

Time: 13:24

STCS vs. LCS: Compaction Strategy Tradeoffs in This Codebase

Two Implementations, Two Philosophies

This repo has two LSM-tree modules that illustrate the compaction spectrum DDIA Chapter 3 describes:

1. log-structured-merge-tree/lsm.py — A basic LSM tree with a single, threshold-based compaction (effectively STCS-like behavior)

2. sstable-and-compaction/sstable.py — A full implementation with both size-tiered and leveled strategies, selectable via CompactionManager

Size-Tiered Compaction (STCS)

In sstable-and-compaction/test_sstable.py:47, you can see STCS in action:


manager = CompactionManager(tmpdir, strategy='size_tiered', min_threshold=2)

The simpler LSM tree in lsm.py demonstrates the same idea at line 316:


if len(self._sstables) >= self._compaction_threshold:

When the count of SSTables hits a threshold, they all get merged together. This is the essence of size-tiered: SSTables accumulate at the same "tier" (roughly similar size), and when enough pile up, they merge into a bigger one.

Why STCS favors write throughput: Each write only goes to the memtable (lsm.py:232, self.memtable[key] = value), then gets flushed to a new SSTable when the memtable fills (lsm.py:244, if len(self.memtable) >= self._threshold). Compaction happens lazily — you can let many SSTables accumulate before merging. This means writes never stall waiting for compaction to rearrange levels. The write path is: memtable → flush → done. Compaction is deferred background work.

The read cost: Notice testlsm.py:59-65 — with compactionthreshold=100, the test creates multiple SSTables with the same key "k" and must search newest-to-oldest to find the right value. Without compaction, a read might check the memtable, then every SSTable in reverse order. That's read amplification — potentially O(N) SSTables scanned per lookup.

Leveled Compaction (LCS)

In sstable-and-compaction/test_sstable.py:108:


mgr = CompactionManager(lcs_dir, strategy='leveled', l0_compaction_trigger=2)

The leveled strategy organizes SSTables into levels with size constraints. From the grep results, we can see the key parameters in sstable.py:

The compaction trigger logic (lines 383-388) checks two conditions:

1. Level 0 has too many SSTables (≥ l0_trigger)

2. Any level exceeds its size budget

After compaction, SSTables are promoted: test_sstable.py:119 asserts result[0].level == 1.

Why LCS bounds read amplification: In a leveled scheme, each level (except L0) has non-overlapping key ranges. A point read needs to check at most one SSTable per level. With 7 levels, that's at most ~7 SSTables checked — a bounded O(log N) regardless of how much data you have.

The write cost: Leveled compaction is more aggressive — when a level overflows, its SSTables get merge-sorted into the next level, potentially rewriting SSTables that already exist there. A single write can trigger a cascade of rewrites across levels. This is write amplification — the same data gets rewritten many times as it moves down levels.

The Tradeoff Matrix

| | STCS | LCS |

|---|---|---|

| Write amp | Low (merge similar-size groups lazily) | High (cascade rewrites across levels) |

| Read amp | High (scan many unsorted runs) | Low (≤1 SSTable per level) |

| Space amp | High (duplicate keys across tiers) | Low (compaction deduplicates aggressively) |

| Best for | Write-heavy (logging, time-series) | Read-heavy (user-facing queries) |

This is why Cassandra defaults to STCS (optimized for its common write-heavy workloads) while RocksDB defaults to leveled (optimized for serving reads in embedded database scenarios like MyRocks or CockroachDB's storage layer).

What's Missing

The observations don't include the full CompactionManager implementation (lines 250–438 of sstable.py were truncated). The actual merge logic for leveled compaction — specifically how it selects which L(N) SSTable overlaps with L(N+1) and does a targeted merge — would show the key-range non-overlap invariant that makes LCS reads efficient. The getlevels method (line 371) and levelmax_size (line 377) are visible, but the actual merge-selection and output-splitting logic is not.