Date: 2026-05-29
Time: 06:26
The stop-at-first-corruption strategy is implemented in readrecord (wal.py:37-58). It has two exit paths:
1. EOF or partial read → returns None (benign, normal end-of-file)
2. CRC mismatch → raises ValueError (corruption detected)
Every caller handles this the same way — break out of the read loop:
# _recover_seq_num, wal.py:85-96
for path in self._wal_files():
with open(path, "rb") as f:
while True:
try:
rec = _read_record(f)
if rec is None:
break
max_seq = max(max_seq, rec.seq_num)
except ValueError:
break # ← stops reading THIS file
Here's the critical detail: the behavior differs between same-file and cross-file corruption.
In recoverseqnum, the break exits the inner while True loop, but the outer for path in self.wal_files() loop continues. So if file 2 of 4 has corruption mid-way:
| File | Records | What happens |
|------|---------|-------------|
| 000001.wal | seq 1–50 | All read ✓ |
| 000002.wal | seq 51–100, corruption at seq 75 | Reads 51–74, stops. Seq 75–100 lost |
| 000003.wal | seq 101–150 | All read ✓ |
| 000004.wal | seq 151–200 | All read ✓ |
The recovered max_seq would be 200 (correct), but replay (whose source is past line 200, not included in the observations) likely follows the same pattern: it would silently drop records 75–100 from file 2 while still replaying files 3 and 4.
Records 75–100 in the example above are valid, intact data. They're dropped solely because an earlier record in the same file had a bad CRC. The caller has no way to know these records existed. The testcorruption test (testwal.py:45-57) only validates single-file corruption where the damaged record is at the tail — it never tests mid-file corruption with valid records following.
appendbatch (wal.py:153-168) writes all operations plus a COMMIT record as one buffer, but mayberotate (wal.py:132-134) runs *after* the write. If a batch is the last thing written before a crash and the file was near maxfilesize, the batch could straddle a rotation boundary if the implementation were slightly different. In the current code, rotation happens post-write so a single batch stays in one file — but this is a fragile invariant that depends on maybe_rotate never being called mid-batch.
If corruption causes records 75–100 to be dropped from file 2, but files 3 and 4 replay fine, the consumer sees a gap from seq 74 to seq 101. Nothing in the current code flags this gap. A consumer that assumes monotonically increasing sequence numbers *without gaps* would silently accept an incomplete log.
For tail corruption (torn writes at crash time): yes. This is the common case. A crash mid-write produces a partial record at the end of the last file. readrecord returns None for short reads, and no valid data follows.
For mid-file corruption (bit rot, media errors): no. Valid records after the corruption point are silently discarded. In a multi-file WAL, this creates a hole in the sequence — data before and after the corrupted file segment is recovered, but the gap is never reported.
Production WALs (PostgreSQL, SQLite, RocksDB) handle this differently: they either use per-record framing that allows scanning past corruption, or they treat any corruption as a hard error requiring operator intervention. This implementation takes the simpler path, which is appropriate for a teaching codebase but would need hardening for production use.
write-ahead-log/wal.py:replay — The replay method (past line 200) is the primary consumer of recovery; verify whether it also continues across files or stops globally at first corruptionwrite-ahead-log/wal.py:truncate — Truncation also uses the break-on-corruption pattern (visible at line 199); if corruption precedes the truncation point, records that should be kept may be silently droppedlog-structured-merge-tree/lsm.py:WAL — The LSM tree has its own simpler WAL class (line 13); compare its corruption handling to the standalone WAL modulebatch-atomicity-across-rotation — Explore what happens if appendbatch writes a buffer that pushes past maxfile_size — does the batch stay atomic or can it split?gap-detection-during-replay — Whether any consumer of replay validates sequence number continuity, which would surface silent data loss from mid-file corruptionwal-corruption-stops-per-file — recoverseq_num stops at the first corrupted record within each file but continues to subsequent WAL files, so cross-file recovery is resilient but intra-file records after corruption are lostwal-no-gap-detection — Neither the WAL reader nor any visible consumer checks for sequence number gaps after replay, making silent data loss from mid-file corruption undetectablewal-batch-single-file — appendbatch writes the entire batch buffer in one fd.write call before mayberotate, so a batch never spans two WAL files under normal operationwal-tail-corruption-safe — Tail corruption (partial writes from crashes) is safely handled because readrecord returns None on short reads, and no valid records can follow a torn write at EOF