Date: 2026-05-29
Time: 10:34
DDIA Chapter 3 presents three foundational crash-safety strategies for storage engines: write-ahead logging (WAL), append-only / copy-on-write structures, and checksummed records with recovery. This codebase implements all three — but with varying fidelity to the textbook models. The gaps are instructive: they highlight exactly which details separate a correct description from a correct implementation.
Chapter 3 describes WAL-based recovery but doesn't dive deep into the redo/undo distinction. The textbook treatment of ARIES (in database internals literature) identifies three approaches:
Every WAL in this codebase is redo-only. The WALRecord dataclass (wal.py:22) stores (seqnum, optype, key, value) where value is the *new* value — there's no field for what was there before. The B-tree WAL stores full page after-images. Neither can reverse an operation.
This is the right choice for these engines. As the entry at topic-redo-vs-undo-logging.md explains: these are either append-only structures (Bitcask, LSM SSTables) where there's nothing to undo in-place, or WAL-protected page stores (B-tree) where uncommitted work is simply discarded by not replaying it. The MVCC database (snapshot-isolation/mvcc_database.py) handles transaction isolation via version chains at a higher level, not via WAL-level undo.
The tradeoff: redo-only recovery time is proportional to the number of records since the last checkpoint. The standalone WAL addresses this with checkpoint records (line ~160) and truncate (line ~170). The B-tree WAL truncates on every commit. The LSM WAL truncates after each flush to SSTable — but as noted above, the truncation-before-SSTable-durable race makes this dangerous.
---
DDIA presents append-only structures as inherently crash-safe: since data is never overwritten, a crash at worst leaves a partial record at the tail of the file. The log *is* the data structure, and recovery means scanning to find the last complete record.
Bitcask (hash-index-storage/bitcask.py) is the purest implementation of this model. The data file is opened in append mode ("ab"), and with syncwrites=True (the default), every write is fsync'd (line ~95-96). Recovery (scandatafile) scans the file sequentially and rebuilds the in-memory hash index. This matches the textbook perfectly.
LSM SSTables are immutable once written — also matching the textbook. But the *creation* of an SSTable is a crash-safety gap. SSTableWriter.finish() in sstable.py calls close() without any preceding fsync(). The SSTable data may still be in the page cache when close() returns. If the WAL is truncated after the flush (as lsm.py:_flush does), a crash can lose both the WAL entries and the SSTable — the data exists nowhere.
The textbook claim "append-only is crash-safe" assumes the append reached stable storage. Without fsync, an append-only file has the same vulnerability as any other file — data in the page cache is volatile.
---
Chapter 3 discusses checksums in the context of detecting torn writes (partial page writes due to crashes). LevelDB and RocksDB use per-block CRC32 checksums. B-trees may use page-level checksums.
The implementations split into two tiers:
With checksums:
wal.py): CRC32 per record (line ~22). Recovery rejects entries where the checksum doesn't match, stopping at the first corruption.btree.py): CRC32 per page image (line ~170 in recover). Same break-on-corruption strategy.Without checksums:
lsm.py): No checksums at all. Recovery relies on length-prefix framing and stops at the first short read.topic-sstable-footer-integrity.md).The break-on-corruption strategy (stop replaying at first bad record) is reasonable for tail corruption from torn writes, but dangerous for mid-file corruption: valid records after the corruption point are silently lost.
---
LevelDB, RocksDB, and SQLite all fsync the parent directory after creating new files. This ensures the *directory entry* — not just the file contents — is durable. Without this, a newly created file can vanish on crash even if its contents were fsync'd.
No module in this codebase fsyncs a directory file descriptor. The entry at topic-directory-fsync-gap.md documents this comprehensively. The pattern os.open(dir, os.ORDONLY) + os.fsync(dirfd) appears nowhere. This affects:
wal.py:_rotate): New WAL segment can vanishThe most dangerous compound failure: SSTable created (without directory fsync) → WAL truncated → crash → both gone.
---
The entire codebase uses os.fsync(), which on macOS/APFS may not flush the disk write cache. True durability on macOS requires fcntl(fd, F_FULLFSYNC), which is never used. On this platform (Darwin 24.4.0), every fsync() call in the codebase provides weaker guarantees than the code implies. Additionally, os.fdatasync() is unavailable on macOS, so the optimization discussed in topic-fdatasync-vs-fsync-optimization.md would need a platform fallback.
---
| Strategy | DDIA textbook | This codebase | Gap |
|----------|--------------|---------------|-----|
| WAL durability | fsync before ack | 2 of 3 WALs fsync; LSM WAL does not | LSM WAL is not crash-safe |
| WAL → data ordering | WAL durable before data written | B-tree does this correctly; LSM does not | LSM can lose both WAL and SSTable |
| Redo-only logging | Appropriate for append-only engines | Used consistently | Correct choice |
| Checksums | Per-block CRC | WALs have CRC; SSTables and Bitcask do not | Silent corruption possible in SSTables |
| Directory fsync | Required after file creation | Never done | New files can vanish on crash |
| Group commit | Documented tradeoff | Standalone WAL batch mode | Correctly implemented |
| Append-only safety | Crash leaves partial tail record | Bitcask correct; SSTables not fsync'd | SSTable creation has durability gap |
The B-tree storage engine and standalone WAL are the closest to textbook-correct. The LSM tree has the most significant gaps — its WAL provides no crash durability, and the flush-to-SSTable path can lose data in both the WAL and the SSTable simultaneously. The hash-index Bitcask is correct when sync_writes=True but has no checksums.
---
b-tree-storage-engine/btree.py:commit — The textbook-correct WAL commit fence: fsync data file, then truncate WAL. Compare with the LSM tree's _flush to see the ordering violation.log-structured-merge-tree/lsm.py:_flush — The most dangerous code path: WAL truncation before SSTable is durable. Trace exactly when data can exist in neither location.aries-vs-redo-only-recovery — This codebase uses redo-only everywhere; explore what ARIES-style redo+undo would add (concurrent transaction abort, fine-grained page LSNs) and why it's overkill here.sstable-and-compaction/sstable.py — No fsync, no checksums, unprotected footer — compare with LevelDB's table format (per-block CRC, magic number, metaindex block) to see the production hardening this teaching implementation omits.write-to-temp-then-rename-pattern — Production databases use atomic rename (write-temp → fsync → rename → fsync-dir) for safe file creation. None of these implementations use this pattern.btree-wal-is-textbook-correct — The B-tree WAL follows the DDIA textbook protocol: fsync WAL entry before data write, fsync data file before WAL truncation, with CRC32 checksums for torn-write detectionlsm-wal-violates-write-ahead-invariant — The LSM WAL calls flush() without fsync() at lsm.py:26, violating the fundamental WAL invariant that log entries must be durable before the operation is acknowledgedlsm-flush-has-dual-loss-window — The LSM _flush path can lose data in both the WAL (truncated) and SSTable (not fsync'd) simultaneously, because neither is made durable before the other is destroyedno-directory-fsync-in-codebase — No implementation fsyncs parent directories after file creation, leaving all file-creating operations (WAL rotation, SSTable flush, compaction output) vulnerable to the directory-entry-loss failure moderedo-only-is-correct-for-append-only-engines — All three WAL implementations use redo-only logging (after-images only), which is the correct choice for append-only and log-structured engines where in-place overwrites don't occur