File: log-structured-merge-tree/lsm.py

Date: 2026-05-28

Time: 18:09

log-structured-merge-tree/lsm.py

Purpose

This file implements a Log-Structured Merge Tree (LSM Tree) storage engine — the write-optimized data structure described in Chapter 3 of *Designing Data-Intensive Applications*. It owns the full read/write path: ingesting key-value pairs through an in-memory buffer (memtable), persisting them to sorted on-disk files (SSTables), and merging those files via compaction. This is the same fundamental architecture behind LevelDB, RocksDB, and Cassandra's storage layer.

Key Components

TOMBSTONE = b""

A sentinel empty-bytes value representing a deleted key. Deletions are writes — the tombstone propagates through memtable flushes and is only garbage-collected during compaction.

WAL (Write-Ahead Log)

Durability layer. Every mutation hits the WAL before the memtable, so an in-memory crash loses nothing.

SSTable (Sorted String Table)

Immutable on-disk file containing key-value pairs in sorted order, with a sparse index embedded in a footer for efficient lookups.

File layout:


[data entries: (klen, key, vlen, value)*]
[footer: (klen, key, offset)* for every Nth entry]
[footer_start: 8 bytes]
[index_count: 4 bytes]

LSMTree

The top-level storage engine coordinating all components.

Patterns

Write-optimized with read amplification tradeoff. Writes are always sequential (append to WAL, insert into sorted memtable). Reads may touch memtable + N SSTables in the worst case — classic LSM tradeoff.

Tombstone-based deletion. Deletes never remove data in place. The tombstone must survive through all levels until compaction merges it out. This is why get() checks v == TOMBSTONE and returns None rather than the bytes.

Sparse indexing. Rather than indexing every key (which would be large), the SSTable samples every 16th key. Lookups binary-search the sparse index then linear-scan a small block. This is the same approach described in DDIA's SSTable section.

Sequence numbers for recency. Each SSTable gets a monotonically increasing seq. During both point reads (reversed(self._sstables)) and compaction (sort by -seq), newer data always wins.

Size-tiered compaction (simplified). All SSTables are in a single "level" and compaction merges everything into one file. This is closer to size-tiered than leveled compaction — simpler but with higher space amplification.

Dependencies

Imports:

Imported by:

Flow

Write Path


put(key, value)
  → WAL.append(key, encoded_value)
  → memtable[key] = encoded_value
  → if len(memtable) >= threshold:
      _flush()
        → freeze memtable, replace with empty SortedDict
        → SSTable.write(frozen entries)
        → WAL.truncate()
        → if len(sstables) >= compaction_threshold:
            compact()

Read Path


get(key)
  → check memtable           (newest)
  → check immutable memtables (newest first)
  → check SSTables            (newest first, via sparse index + scan)
  → return first match, or None

Recovery Path


LSMTree.__init__()
  → _load_existing_sstables()   (scan dir for .sst files, load their indexes)
  → WAL(path)
  → _replay_wal()               (re-populate memtable from WAL entries)

Invariants

1. WAL-before-memtable. Every mutation writes to the WAL before the memtable, ensuring crash recovery can reconstruct the memtable state.

2. SSTables are sorted by key internally and tracked newest-last in self._sstables (sorted by seq).

3. Newest data wins. In get(), memtable is checked first; SSTables are traversed in reverse-seq order. In range_scan(), higher-priority sources overwrite lower. In compact(), entries are sorted by (key, -seq) and only the first occurrence (newest) is kept.

4. Tombstones are only removed during compaction. A tombstone in the memtable or an SSTable will shadow older values of the same key until compact() runs.

5. WAL is truncated only after a successful SSTable write. If the process crashes between WAL.append and flush, replay recovers the data. If it crashes after SSTable.write but before WAL.truncate, the WAL entries are harmlessly replayed into the memtable (idempotent because it's a dict).

6. Sparse index interval is fixed per SSTable at construction time.

Error Handling

Error handling is minimal, consistent with a teaching implementation:

Topics to Explore

Beliefs