Topic: How LevelDB's MANIFEST file provides crash-safe version transitions; compare with the WAL crash recovery in lsm.py (lines 88–96) which only covers data, not metadata

Date: 2026-05-29

Time: 13:44

MANIFEST vs. WAL: Metadata Crash Safety

The Gap in This Implementation

The LSM tree in log-structured-merge-tree/lsm.py has a WAL that protects data — unflushed memtable entries survive a crash via WAL.replay() (lines 28–52). But there's no equivalent protection for metadata: which SSTables exist, what sequence numbers are valid, and (in a leveled scheme) what level each file belongs to.

On startup, LSMTree.loadexisting_sstables() (lines 218–228) reconstructs metadata by scanning the directory:


files = sorted(f for f in os.listdir(self._dir) if f.endswith(".sst"))
for fname in files:
    seq = int(fname.split("_")[1].split(".")[0])
    sst = SSTable(os.path.join(self._dir, fname), seq, self._sparse_interval)
    sst.load_index()
    self._sstables.append(sst)

This works for a simple single-level design, but it's fragile. Consider what happens during compaction: you merge several SSTables into one new file, then delete the old ones. If a crash occurs after creating the new file but before deleting the old files, recovery sees both — duplicate data with no way to know which files are authoritative. The inverse (old deleted, new not yet visible) loses data entirely.

What LevelDB's MANIFEST Solves

LevelDB treats the set of live SSTables as a versioned, immutable snapshot called a Version. Every structural change — flush, compaction, file deletion — is recorded as a VersionEdit and appended to the MANIFEST file. This is essentially a WAL for metadata:

1. VersionEdit encodes a delta: "add file X at level 2 with key range [a, m]; remove files Y and Z from level 1."

2. The edit is fsync'd to MANIFEST before any file deletion occurs.

3. On recovery, LevelDB replays all VersionEdits from the MANIFEST to reconstruct the current Version — exactly which files are live and at what levels.

4. A CURRENT file (a single line) points to the active MANIFEST, enabling atomic switchover when the MANIFEST itself is rotated.

This gives crash-safe atomic version transitions: the old set of SSTables is valid until the VersionEdit is durable, at which point the new set is valid. There's never an ambiguous in-between.

Comparing the Two Recovery Paths

| Concern | lsm.py WAL | LevelDB MANIFEST |

|---------|--------------|-----------------|

| What it protects | Memtable entries (key-value data) | SSTable membership, levels, key ranges |

| Recovery method | Replay append log into memtable | Replay VersionEdits to rebuild Version |

| Atomicity unit | Single key write | Entire compaction/flush (add + remove files) |

| Corruption detection | Truncation on partial read (line 37–40) | CRC per record block |

| Metadata recovery | Directory scan (os.listdir) | Deterministic replay |

The standalone WAL module in write-ahead-log/wal.py is more sophisticated — it has CRC checksums (line 36), batch commits with OPCOMMIT markers (lines 143–157), checkpoints (lines 159–166), and rotation (lines 105–114). But even this richer WAL only covers data operations. Neither module logs structural changes like "file sst007.sst is now part of the live set."

Why This Matters

The directory-scan approach in loadexisting_sstables works here because this implementation has no levels — all SSTables are peers searched newest-to-oldest. But the moment you add leveled compaction (as in sstable-and-compaction/sstable.py, which already has a level field in SSTableMetadata at line 39), you need to know which level each file belongs to. That information isn't in the filename or the file contents — it's metadata that lives in the MANIFEST. Without it, a crash during compaction can corrupt the level structure silently.