Topic: The paper describes time-window-based merge scheduling to avoid compaction during peak load — neither implementation models this, but it's critical for production use

Date: 2026-05-29

Time: 12:32

Time-Window-Based Merge Scheduling: A Production Gap

What the Implementations Actually Do

Both LSM tree implementations in this repo trigger compaction synchronously and unconditionally — the exact opposite of time-window-aware scheduling.

log-structured-merge-tree/lsm.py

The LSMTree constructor at line 202-208 accepts a compaction_threshold (default 4). Compaction fires inline during writes:


# line 315-317
if len(self._sstables) >= self._compaction_threshold:
    self.compact()

This means a put() or delete() call that happens to tip the SSTable count over the threshold will block the caller while compact() runs a full k-way merge (line 319, line 323). There's no notion of "now is a bad time" — the write that triggers threshold just pays the cost.

sstable-and-compaction/sstable.py

The CompactionManager (visible in tests at testsstable.py:51-56) uses the same pattern: needscompaction() checks a count threshold (minthreshold=2 for size-tiered, l0compactiontrigger=2 for leveled), and runcompaction() executes immediately when called. The strategy selection (size-tiered vs. leveled) controls *which* SSTables merge, but not *when* the merge happens relative to system load.

What DDIA Describes and Why It Matters

Kleppmann's discussion of LSM compaction highlights a fundamental tension: compaction competes with foreground reads and writes for disk I/O bandwidth. In production systems like Cassandra and RocksDB, this manifests as:

1. Write stalls — if compaction can't keep up, the number of L0 SSTables grows, reads slow down (more files to check), and eventually writes must be throttled or blocked.

2. Latency spikes — a large compaction running during peak traffic causes p99 latency to spike because the disk is saturated with merge I/O.

Time-window-based scheduling addresses this by deferring non-urgent compaction to off-peak hours (e.g., overnight), rate-limiting compaction I/O during peak periods, or using separate I/O priorities. Cassandra's compactionthroughputmbpersec and RocksDB's rate_limiter are real-world examples.

What's Missing From Both Implementations

Neither implementation has any of the following, all of which would be needed for production-viable compaction scheduling:

Why This Gap Is Critical

For a teaching implementation, immediate compaction is fine — it demonstrates the merge logic clearly. But any team porting this to production must understand that the *scheduling policy* around compaction is as important as the merge algorithm itself. Without it:

This is a case where the algorithm is correct but the operational envelope is absent.

Topics to Explore

Beliefs