Date: 2026-05-29
Time: 10:41
When a leaf overflows during insertion, the B-tree must perform three coordinated writes:
1. New right page — contains the upper half of keys, inherits the old next_sibling pointer
2. Modified left page — retains the lower half, its next_sibling now points to the new right page
3. Modified parent — gains a separator key and child pointer to the new right page
If the process crashes after only some of these writes reach disk, the leaf sibling chain used by rangescan (b-tree-storage-engine/btree.py) could be broken. A rangescan walks the chain via next_sibling pointers — if a pointer leads to an uninitialized or corrupt page, or if the chain skips the new right page entirely, keys silently disappear from range results.
The crash-safety mechanism has three layers:
walwrite_pageEvery page mutation flows through walwritepage (documented in the walwritepage entry), which enforces a strict protocol:
WAL log entry (with fsync) → data file write (flush only, no fsync)
Each WAL entry is individually fsynced to disk before the data file is touched. This means the WAL always contains at least as much information as the data file. If the process dies at any point, the WAL has a durable record of every intended write up to that point.
putThe put method (documented in the put entry) structures a complete mutation as a single WAL transaction:
_insert(root, key, value, height) # WAL-logs all page writes (leaf, split pages, parent updates)
└─ _wal_write_page(right_page) # ① new right leaf (fsync'd to WAL)
└─ _wal_write_page(left_page) # ② modified left leaf (fsync'd to WAL)
└─ _wal_write_page(parent) # ③ updated parent node (fsync'd to WAL)
_wal_write_meta(...) # ④ metadata (new root, key count, etc.)
wal.commit(pm) # ⑤ fsync data file, then truncate WAL
The critical property: insert never calls wal.commit() (documented in the insert entry as btree-insert-no-wal-commit). All page writes accumulate in the WAL, and only put calls commit() after everything is written. The commit sequence in WAL.commit (documented in the commit entry) is:
1. page_manager.sync() — fsync the data file
2. Truncate the WAL to zero
3. fsync the WAL file
This ordering ensures that the data file is durable before the WAL is cleared. If a crash occurs during commit, either the WAL still exists (and recovery replays it) or the data file is already consistent.
WAL.recoverOn every startup, BTree._init_ unconditionally calls WAL.recover() (documented in the recover entry). Recovery reads the entire WAL, verifies each entry's CRC32 checksum, and replays valid entries by overwriting the corresponding pages in the data file. Then it fsyncs the data file and truncates the WAL.
The replay is idempotent — writing the same page image twice produces the same result. Entries with failed checksums (torn writes from a crash mid-write) are silently skipped.
The order of writes within _insert during a leaf split is significant:
1. The right page is written first — it contains the upper half of keys and inherits the old next_sibling value
2. The left page is written second — its next_sibling is updated to point to the new right page
Because each walwrite_page call fsyncs the WAL entry before proceeding, there are only three possible states after a crash:
| Crash point | WAL contains | Recovery result | Sibling chain |
|---|---|---|---|
| Before right page WAL entry completes | Nothing (or torn entry, skipped by checksum) | Pre-split state | Intact — old chain unchanged |
| After right page logged, before left page | Right page only | Right page exists but left page still has old sibling pointer | Intact — left still points to old next, right page is orphaned but unreachable |
| After both logged, before commit | Both pages | Both replayed — left points to right, right points to old next | Correct post-split chain |
The key insight: the sibling chain never enters a state where it points to a nonexistent or corrupt page. The left page's sibling pointer is only updated after the right page (the target of that pointer) is already durable in the WAL.
The entries document two real gaps in the crash-atomicity story:
1. No commit marker in the WAL — Unlike the standalone WriteAheadLog in write-ahead-log/wal.py (which has explicit OPCOMMIT records and appendbatch for atomic groups), the B-tree's WAL class has no transaction boundaries. recover() replays ALL valid entries unconditionally. This means a crash mid-split replays a prefix of the intended writes — not all-or-nothing. The sibling chain itself is safe (due to write ordering), but the tree structure can be inconsistent: if the parent update wasn't logged, the right page is reachable via sibling chain (range scans work) but unreachable via tree descent (point lookups for keys in the right page fail silently).
2. Page allocation outside WAL coverage — The insert entry documents this as btree-allocate-page-outside-wal: pm.allocatepage() modifies metadata (the nextfree counter or free list) through PageManager.writemeta, which is not WAL-logged. If a crash occurs after allocation but before the split completes and metadata is re-written via walwrite_meta, the metadata page may be inconsistent — a page was consumed from the allocator but never populated with valid data.
The test testwalrecovery in b-tree-storage-engine/testbtree.py (lines 107-122) validates basic crash recovery by closing file handles without calling close() (simulating a crash), then reopening. However, it doesn't test the mid-split crash scenario — it doesn't verify that a crash between writing the right page and the left page leaves the sibling chain intact. A more targeted test would need to inject a crash between specific walwritepage calls.