File: sstable-and-compaction/sstable.py

Date: 2026-05-28

Time: 18:23

sstable-and-compaction/sstable.py

Purpose

This file is a self-contained implementation of Sorted String Tables (SSTables) — the on-disk data structure at the heart of LSM-tree storage engines like LevelDB, RocksDB, and Cassandra. It owns three responsibilities:

1. Binary file format — writing and reading a custom SSTable format with header, sorted entries, sparse index, and footer.

2. K-way merge — combining multiple SSTables into one while deduplicating by key and resolving conflicts by timestamp.

3. Compaction strategies — both size-tiered (STCS) and leveled (LCS) compaction, implementing the two major strategies discussed in DDIA Chapter 3.

This corresponds directly to the "SSTables and LSM-Trees" section of DDIA, providing a concrete implementation of concepts that are typically described at whiteboard level.

Key Components

Constants and Wire Format


MAGIC = b"SSTB"          # 4-byte file identifier
HEADER_FMT = ">4sHI"     # magic(4) + version(2) + entry_count(4) = 10 bytes
FOOTER_FMT = ">QI"        # index_offset(8) + index_count(4) = 12 bytes
TOMBSTONE_MARKER = 0xFF   # single-byte sentinel for deleted keys

The file layout is: [header][entries...][sparse index][footer]. The footer is at the end so the reader can seek to EOF minus FOOTER_SIZE to bootstrap — a pattern shared with real SSTable formats.

SSTableEntry / SSTableMetadata

Simple data carriers. SSTableEntry.value = None encodes a tombstone (delete marker). SSTableMetadata captures key range, entry count, level, and timestamp — everything needed for compaction decisions without opening the file.

SSTableWriter

Contract: Caller must add keys in sorted order. The writer builds a sparse index by recording every block_size-th entry's key and byte offset.

The entry encoding is: [keylen:2][keybytes][timestamp:8][tombstonebyte | valuelen:4 + valuebytes]. Note the asymmetry: tombstones are 1 byte, values are 4+N bytes. This means the read_entry parser reads 1 byte and branches on whether it equals 0xFF.

SSTableReader

Reads an SSTable file, loading the sparse index into memory on construction.

On construction, the reader also determines minkey, maxkey, and _timestamp by reading the first entry and scanning from the last index entry to find the final entry. These are cached for metadata queries.

mergesstables(readers, outputpath, remove_tombstones=False)

A classic k-way merge using a min-heap. The heap entries are (key, -timestamp, sourceindex, entry, iterator). The negative timestamp means that when two entries share a key, the newest one is popped first. All subsequent entries with the same key are skipped via the lastkey guard — this is how last-write-wins conflict resolution works.

remove_tombstones=True drops delete markers, which is appropriate only at the bottom level of compaction where there's no older data that needs to be shadowed.

CompactionManager

Manages a collection of SSTables and implements two compaction strategies:

Size-Tiered (STCS): Groups SSTables into buckets by file size tier. When any tier accumulates min_threshold (default 4) SSTables, they're merged into one. Simple and write-optimized — good for write-heavy workloads but can cause space amplification.

Leveled (LCS): Organizes SSTables into levels with exponentially increasing size caps (levelbasesize * fanout^(level-1)). L0 compacts into L1 when it has l0_trigger SSTables. Higher levels compact by picking the SSTable with the most overlap with the next level and merging it down. This bounds space amplification and read amplification at the cost of more write amplification.

Patterns

Dependencies

Imports: All stdlib — heapq for k-way merge, struct for binary encoding, os for file size, time for path generation, dataclasses and typing for structure.

Imported by: testsstable.py and testertest_sstable.py — only test harnesses. This module is a leaf in the dependency graph with no downstream production consumers.

Flow

Write path:

1. Caller creates SSTableWriter, calls add() for each key in sorted order.

2. Every block_size entries, the sparse index records the key and byte offset.

3. finish() appends the index block, footer, patches the header, and closes the file.

Read path (point lookup):

1. SSTableReader._init_ reads header, footer, and sparse index into memory.

2. get(key) binary-searches the index to find the block, then linearly scans within the block.

3. Returns the entry if found, None otherwise. If the key falls between index entries, scanning stops at the next index entry's offset.

Compaction path:

1. CompactionManager.needs_compaction() checks trigger conditions.

2. runcompaction() selects which SSTables to merge, calls mergesstables(), updates the internal list, and returns the new SSTables.

3. For LCS, merged output gets level = target_level assigned.

Invariants

Error Handling

Minimal. The reader asserts magic == MAGIC on open — the only validation. There is:

This is appropriate for a teaching implementation — it shows the algorithms clearly without the defensive engineering that production systems need.

Topics to Explore

Beliefs