Topic: DDIA Chapter 3 discusses alternative crash-safety strategies; compare this implementation against the textbook models

Date: 2026-05-29

Time: 10:34

DDIA Chapter 3 Crash-Safety Strategies: Textbook vs. This Implementation

DDIA Chapter 3 presents three foundational crash-safety strategies for storage engines: write-ahead logging (WAL), append-only / copy-on-write structures, and checksummed records with recovery. This codebase implements all three — but with varying fidelity to the textbook models. The gaps are instructive: they highlight exactly which details separate a correct description from a correct implementation.

2. Redo-Only vs. Undo vs. ARIES

What DDIA says

Chapter 3 describes WAL-based recovery but doesn't dive deep into the redo/undo distinction. The textbook treatment of ARIES (in database internals literature) identifies three approaches:

What the code does

Every WAL in this codebase is redo-only. The WALRecord dataclass (wal.py:22) stores (seqnum, optype, key, value) where value is the *new* value — there's no field for what was there before. The B-tree WAL stores full page after-images. Neither can reverse an operation.

This is the right choice for these engines. As the entry at topic-redo-vs-undo-logging.md explains: these are either append-only structures (Bitcask, LSM SSTables) where there's nothing to undo in-place, or WAL-protected page stores (B-tree) where uncommitted work is simply discarded by not replaying it. The MVCC database (snapshot-isolation/mvcc_database.py) handles transaction isolation via version chains at a higher level, not via WAL-level undo.

The tradeoff: redo-only recovery time is proportional to the number of records since the last checkpoint. The standalone WAL addresses this with checkpoint records (line ~160) and truncate (line ~170). The B-tree WAL truncates on every commit. The LSM WAL truncates after each flush to SSTable — but as noted above, the truncation-before-SSTable-durable race makes this dangerous.

---

3. Append-Only and Log-Structured Designs

What DDIA says

DDIA presents append-only structures as inherently crash-safe: since data is never overwritten, a crash at worst leaves a partial record at the tail of the file. The log *is* the data structure, and recovery means scanning to find the last complete record.

What the code does

Bitcask (hash-index-storage/bitcask.py) is the purest implementation of this model. The data file is opened in append mode ("ab"), and with syncwrites=True (the default), every write is fsync'd (line ~95-96). Recovery (scandatafile) scans the file sequentially and rebuilds the in-memory hash index. This matches the textbook perfectly.

LSM SSTables are immutable once written — also matching the textbook. But the *creation* of an SSTable is a crash-safety gap. SSTableWriter.finish() in sstable.py calls close() without any preceding fsync(). The SSTable data may still be in the page cache when close() returns. If the WAL is truncated after the flush (as lsm.py:_flush does), a crash can lose both the WAL entries and the SSTable — the data exists nowhere.

Textbook gap: Append-only safety requires *complete* writes

The textbook claim "append-only is crash-safe" assumes the append reached stable storage. Without fsync, an append-only file has the same vulnerability as any other file — data in the page cache is volatile.

---

4. Checksums and Corruption Detection

What DDIA says

Chapter 3 discusses checksums in the context of detecting torn writes (partial page writes due to crashes). LevelDB and RocksDB use per-block CRC32 checksums. B-trees may use page-level checksums.

What the code does

The implementations split into two tiers:

With checksums:

Without checksums:

The break-on-corruption strategy (stop replaying at first bad record) is reasonable for tail corruption from torn writes, but dangerous for mid-file corruption: valid records after the corruption point are silently lost.

---

5. The Directory Fsync Gap

What production systems do

LevelDB, RocksDB, and SQLite all fsync the parent directory after creating new files. This ensures the *directory entry* — not just the file contents — is durable. Without this, a newly created file can vanish on crash even if its contents were fsync'd.

What the code does

No module in this codebase fsyncs a directory file descriptor. The entry at topic-directory-fsync-gap.md documents this comprehensively. The pattern os.open(dir, os.ORDONLY) + os.fsync(dirfd) appears nowhere. This affects:

The most dangerous compound failure: SSTable created (without directory fsync) → WAL truncated → crash → both gone.

---

6. Platform-Specific Durability (macOS)

The entire codebase uses os.fsync(), which on macOS/APFS may not flush the disk write cache. True durability on macOS requires fcntl(fd, F_FULLFSYNC), which is never used. On this platform (Darwin 24.4.0), every fsync() call in the codebase provides weaker guarantees than the code implies. Additionally, os.fdatasync() is unavailable on macOS, so the optimization discussed in topic-fdatasync-vs-fsync-optimization.md would need a platform fallback.

---

Summary: Implementation vs. Textbook

| Strategy | DDIA textbook | This codebase | Gap |

|----------|--------------|---------------|-----|

| WAL durability | fsync before ack | 2 of 3 WALs fsync; LSM WAL does not | LSM WAL is not crash-safe |

| WAL → data ordering | WAL durable before data written | B-tree does this correctly; LSM does not | LSM can lose both WAL and SSTable |

| Redo-only logging | Appropriate for append-only engines | Used consistently | Correct choice |

| Checksums | Per-block CRC | WALs have CRC; SSTables and Bitcask do not | Silent corruption possible in SSTables |

| Directory fsync | Required after file creation | Never done | New files can vanish on crash |

| Group commit | Documented tradeoff | Standalone WAL batch mode | Correctly implemented |

| Append-only safety | Crash leaves partial tail record | Bitcask correct; SSTables not fsync'd | SSTable creation has durability gap |

The B-tree storage engine and standalone WAL are the closest to textbook-correct. The LSM tree has the most significant gaps — its WAL provides no crash durability, and the flush-to-SSTable path can lose data in both the WAL and the SSTable simultaneously. The hash-index Bitcask is correct when sync_writes=True but has no checksums.

---

Topics to Explore

Beliefs