Date: 2026-05-29
Time: 11:44
WAL.replay — Write-Ahead Log Recoveryreplay reconstructs the in-memory state of uncommitted writes after a crash or restart. It reads the WAL file from disk and deserializes every entry that was appended but not yet flushed to an SSTable. Without this, any writes that reached the WAL but hadn't triggered a memtable flush would be silently lost.
This is the recovery half of the WAL contract: append guarantees durability (write-before-acknowledge), and replay guarantees that durable data is restored into the memtable on startup.
self.path is set (always true after init_). The file may or may not exist.None beyond self. The WAL path was fixed at construction time.
List[Tuple[str, bytes]] — a list of (key, valuebytes) pairs. The caller (LSMTree.replaywal) inserts each pair directly into the active memtable via self.memtable[key] = value. The caller does not need to handle errors — an empty list means either no WAL exists or it was already truncated.
Values are raw bytes, not decoded strings. This includes tombstones (b""), which the caller must treat as deletions. Replay does not distinguish puts from deletes — it just restores whatever was written.
1. Bail if no file: If the WAL path doesn't exist on disk, return an empty list. This handles first-run or post-truncation states.
2. Bulk read: Slurp the entire file into memory as a single bytes object. This is a deliberate choice — the WAL is bounded by the memtable threshold (default 1000 entries), so it's small enough to fit in RAM.
3. Sequential deserialization loop: Walk through the byte buffer with a manual cursor (pos), decoding the same TLV (type-length-value) wire format that append writes:
`
[4B keylen][keybytes][4B valuelen][valuebytes]
`
Each field is big-endian unsigned 32-bit (>I format in struct).
4. Boundary checks at every step: Before reading each length prefix or payload, the method checks whether enough bytes remain. If not, it breaks out of the loop — treating the remainder as a torn write. This is the crash-tolerance mechanism: if the process died mid-append, the partial trailing entry is simply ignored.
5. Accumulate and return: Complete entries are appended to the list and returned in order.
self._fd used by append).The method is intentionally lenient:
decode("utf-8") will raise UnicodeDecodeError. This is unguarded — the assumption is that all keys were written by append, which encodes from valid Python strings.Called exactly once, during LSMTree._init, via replay_wal:
def _replay_wal(self):
for key, value in self._wal.replay():
self._memtable[key] = value
This replays writes into the fresh memtable before the tree accepts any new operations. The replay-then-serve sequence is the standard recovery protocol for WAL-backed storage engines.
After a successful flush-to-SSTable, WAL.truncate() empties the file, so replay on the *next* startup won't re-apply already-persisted entries. The correctness of this relies on the flush happening atomically before the truncate — if the process crashes between flush and truncate, replay will re-insert entries that are already in an SSTable, which is safe because the memtable values shadow SSTable values during reads (newer wins).
os.path.exists — file existence checkstruct.unpack — big-endian integer deserialization (mirrors the struct.pack in append)WAL.append — the two methods must agree on >I length prefixes and the [klen][key][vlen][value] layoutlog-structured-merge-tree/lsm.py:WAL.append — The write side of the WAL protocol; understanding the wire format makes replay's parsing logic obviouslog-structured-merge-tree/lsm.py:LSMTree._flush — The flush-then-truncate sequence that determines when WAL entries become redundantwal-torn-write-recovery — How production WALs (RocksDB, LevelDB) use checksums and record boundaries to distinguish torn writes from corruption, which this implementation skipslog-structured-merge-tree/lsm.py:SSTable.readentry — Uses the same TLV wire format but reads from a file handle instead of a byte buffer; compare the two parsing approachesreplay-idempotency — Why re-applying WAL entries after a crash-between-flush-and-truncate is safe in an LSM tree (last-writer-wins semantics)wal-replay-tolerates-partial-writes — WAL.replay silently discards any trailing partial entry, treating it as a torn write from a crash; it never raises on truncated datawal-no-checksum-verification — The WAL has no CRC or integrity check; a corrupt-but-length-valid entry will be returned as data without detectionwal-replay-order-matches-append-order — Entries are returned in the same sequential order they were appended, which preserves last-write-wins semantics when replayed into a memtablewal-replay-is-read-only — replay does not modify the WAL file or any in-memory state; the caller (replaywal) is responsible for inserting entries into the memtablewal-flush-truncate-replay-safety — If a crash occurs between SSTable flush and WAL truncation, replay re-inserts already-persisted entries, which is safe because memtable values shadow SSTable values during reads