Topic: How LevelDB's MANIFEST and RocksDB's version sets solve exactly this problem with atomic state transitions

Date: 2026-05-28

Time: 19:04

How LevelDB's MANIFEST and RocksDB's Version Sets Solve Atomic State Transitions

The Problem This Codebase Exposes

The grep for manifest|MANIFEST|versionset|VersionSet|versionedit|VersionEdit returned zero matches. This is the most important observation — the codebase has a working LSM tree with compaction, but it's missing the mechanism that makes compaction crash-safe.

Look at the compaction flow in log-structured-merge-tree/lsm.py. The compact() method at line 319 does a k-way merge of all SSTables into a single new SSTable (line 347), then presumably swaps self._sstables in memory and deletes the old files. This involves multiple steps that are not atomic:

1. Write the new merged SSTable to disk

2. Update the in-memory list (self._sstables) to point to the new file

3. Delete the old SSTable files

If the process crashes between step 1 and step 3, you have orphaned files on disk with no record of which ones are "current." If it crashes between step 2 and step 3, the in-memory state is lost (it was never persisted), and on recovery you'd see both old and new files with no way to distinguish them.

The same problem appears during flush. After _flush() writes a new SSTable (around line 315), the system must atomically transition from "these N SSTables are current" to "these N+1 SSTables are current." The WAL (line 17–66) protects individual key-value writes, but nothing protects the SSTable-level metadata.

What a MANIFEST Solves

LevelDB's MANIFEST is a write-ahead log for metadata — not for user data, but for *which files constitute the database*. Each state transition is recorded as a VersionEdit: a compact record that says "add file X at level L, remove files Y and Z." The MANIFEST file is append-only, and each VersionEdit is written atomically (with checksums, just like the WAL at write-ahead-log/wal.py lines 30–36).

The key insight: the set of live SSTable files is itself mutable state that needs crash protection, exactly like user data needs the WAL.

Here's how it maps to the missing piece in this codebase:

| This codebase does | MANIFEST would add |

|---|---|

| WAL protects memtable writes (lsm.py:17–66) | MANIFEST protects SSTable file set |

| self._sstables list tracks live files in memory | VersionEdit log tracks live files on disk |

| compact() writes new file + deletes old ones non-atomically | VersionEdit atomically records "add new, remove old" |

| Crash recovery replays WAL only (lsm.py — WAL replay) | Recovery replays MANIFEST to reconstruct file set |

How Version Sets Make This Atomic

RocksDB extends this with VersionSet — an in-memory structure where each Version is an immutable snapshot of "which SSTables exist at which levels." A compaction doesn't mutate the current version; it creates a *new* Version by applying a VersionEdit. The transition looks like:


Version_n  +  VersionEdit  →  Version_{n+1}
(old files)   (add/remove)    (new files)

This is an immutable data structure with pointer swap — the same pattern you'd use for lock-free concurrent reads. Ongoing reads can continue using Versionn while compaction installs Version{n+1}. The old version is reference-counted and freed when the last reader finishes.

Compare this to the codebase's compact() at line 319: it mutates self._sstables in place. Any concurrent reader iterating over that list during compaction would see inconsistent state. A VersionSet eliminates this entirely — readers hold a reference to an immutable version, and the swap to a new version is a single pointer update.

The CURRENT File: Bootstrapping Recovery

One subtle detail: the MANIFEST itself can be rotated (it grows over time). LevelDB uses a tiny file called CURRENT that contains just the filename of the active MANIFEST. On startup:

1. Read CURRENT to find the MANIFEST filename

2. Replay the MANIFEST from the beginning to reconstruct the VersionSet

3. Now you know exactly which SSTable files are live

This codebase has no equivalent. The LSMTree._init_ (around line 204) would need to scan the directory and guess which .sst files are valid — a fundamentally unsafe operation after a crash mid-compaction.

What's Missing From the Observations

The observations confirm the absence but can't show the solution in action. To fully understand MANIFEST/VersionSet, you'd need:

The sstable-and-compaction/sstable.py file has SSTableMetadata with a level field (line 39), suggesting level-aware compaction was intended, but the metadata tracking is in-memory only — exactly the gap a MANIFEST would fill.