Date: 2026-05-29
Time: 07:31
Every log reader in this repo follows the same pattern: scan forward sequentially, stop the moment anything looks wrong. There are no attempts to skip past corruption and recover later records.
The WAL implementation in write-ahead-log/wal.py:85-95 shows this most explicitly:
# _recover_seq_num
while True:
try:
rec = _read_record(f)
if rec is None: # EOF or short read
break
max_seq = max(max_seq, rec.seq_num)
except ValueError: # CRC mismatch
break # ← abandon the rest of this file
The readrecord function (wal.py:38-58) uses length-prefixed framing — it reads a 4-byte record length, then reads exactly that many bytes — but if the CRC fails, it raises ValueError, and every caller responds by breaking out of the scan loop entirely.
The Bitcask implementation in log-structured-hash-table/bitcask.py:89-105 does the same thing without the exception:
# _scan_segment
if crc != expected_crc:
break # corrupted record at end, stop scanning
The LSM tree's WAL (log-structured-merge-tree/lsm.py:31-52) is even more bare — no CRC at all, just length-prefixed fields with short-read guards. A single corrupted length field would cause it to read garbage offsets, and there's no way to recover.
The "stop at first error" strategy is correct for tail corruption — the common case where a crash interrupts a write in progress, leaving a partial record at the end of the file. Everything before it is intact; everything after it doesn't exist. The test in test_wal.py:44-58 validates exactly this: corrupt the last 5 bytes, recover, get back only the first record.
But this strategy silently discards valid records that follow a mid-file corruption. If a disk sector goes bad in the middle of the file, or a cosmic ray flips a bit in an earlier record, every record after the corruption is lost — even though they're perfectly valid and fully written.
LevelDB's log format divides the file into fixed-size 32KB blocks. Each record within a block carries its own header: checksum (4B) | length (2B) | type (1B). The type field handles records that span block boundaries (FIRST, MIDDLE, LAST, FULL).
The key insight: if corruption is found, skip to the next 32KB block boundary and resume scanning. You lose at most ~32KB of data instead of the entire tail.
Compared to this codebase's WAL format (wal.py:28-34):
record_length field. If that 4-byte length is corrupted, the reader jumps to a garbage offset and can never resync.The tradeoff: block alignment wastes space at block boundaries (padding bytes when a record doesn't fit the remaining space), and records that span blocks need the FIRST/MIDDLE/LAST machinery. The WAL here packs records tightly with zero waste.
Some formats prepend a magic number (e.g., 0xCAFEBABE) to every record. On corruption, you scan forward byte-by-byte looking for the next magic sequence, then try to parse from there.
None of the implementations here use magic bytes. The Bitcask format (log-structured-hash-table/bitcask.py:12-13) goes straight into crc32 | keysize | valuesize — no sentinel. The WAL format (wal.py:33) starts with recordlength | crc | seqnum | optype | keylen — also no sentinel.
Magic bytes would let these readers resync after mid-file corruption, but with costs:
For these reference implementations, the design is pragmatic:
1. WAL files are short-lived. They're truncated after checkpoint (wal.py:176-200), and the LSM WAL is truncated after flush (lsm.py:56-60). Mid-file corruption in a file that exists for seconds to minutes is unlikely.
2. Segment rotation limits blast radius. Both Bitcask stores rotate segments at a size threshold (bitcask.py:118, hash-index-storage/bitcask.py:101). Losing the tail of one segment doesn't affect other segments.
3. Simplicity matters for correctness. Block-aligned formats add ~100 lines of boundary-management code. Each line is a place for bugs. For a teaching codebase demonstrating DDIA concepts, the simpler approach is easier to audit.
The real cost shows up in production-grade systems where WAL files can grow large before checkpointing, or where storage media failures cause mid-file corruption. That's where LevelDB's block format earns its complexity.
leveldb-log-format — Read the LevelDB logformat.md and logreader.cc to see how block-aligned recovery handles the FIRST/MIDDLE/LAST record spanning and the kBadRecord skip logicwrite-ahead-log/wal.py:encoderecord — Trace how the record_length prefix creates a chain of trust where one corrupted length field breaks all subsequent parsinglog-structured-hash-table/bitcask.py:scansegment — Compare the CRC-based stop-at-error here with the LSM WAL's no-CRC approach in lsm.py:replay and consider what each loses on corruptionwrite-ahead-log/test_wal.py — The corruption test only covers tail corruption (last 5 bytes); try inserting corruption mid-file to observe how many valid trailing records are silently droppedsstable-footer-integrity — The SSTable format in lsm.py stores its sparse index as a footer; explore what happens if the footer's 12-byte trailer is corrupted vs. a mid-file data entrywal-stop-at-first-error — All WAL and segment readers in this codebase break out of the scan loop on the first CRC failure or short read, discarding all subsequent records regardless of their validitywal-no-resync-capability — No implementation uses magic bytes, block alignment, or any other mechanism that would allow resuming parsing after a corrupted record in the middle of a filelsm-wal-no-crc — The LSM tree's WAL (lsm.py:14-53) uses only length-prefixed framing with no checksum, making it unable to distinguish corruption from valid data unless a length field causes an out-of-bounds readsegment-rotation-limits-corruption-blast-radius — Both Bitcask implementations rotate to new segment files at a configurable size threshold, bounding the maximum data loss from stop-at-first-error to the records trailing the corruption within a single segment