btree.py) is insufficient for productionDate: 2026-05-29
Time: 07:48
btree.pyLook at PageManager.free_page (lines 89–94 of btree.py):
def free_page(self, page_num):
root, height, total_keys, next_free, free_head = self.read_meta()
data = struct.pack(HEADER_FMT, 0, 0) + struct.pack('>I', free_head)
self.write_page(page_num, data)
self.write_meta(root, height, total_keys, next_free, page_num)
This does two things in sequence:
1. Overwrites the page with a free-list pointer
2. Updates the metadata page to make this page the new free-list head
If the process crashes between steps 1 and 2, the page's content is destroyed but it's not on the free list — it becomes a leaked page, permanently unreachable. Crash between 2 and the parent pointer update, and you have a dangling reference: the parent still points to a page that's now on the free list and could be reallocated to hold completely unrelated data.
The WAL (lines 112–153) doesn't help here either. It logs page writes and replays them on recovery, but the WAL has no concept of multi-step structural operations. It replays whatever was written — if only one of the two writes made it to the WAL before the crash, recovery faithfully reproduces a half-finished deletion.
PostgreSQL's B-tree splits page deletion into two atomic phases with a recoverable intermediate state:
The leaf page is marked half-dead in a single WAL-logged operation. The page's content is preserved. The parent's downlink to this page is removed atomically with the half-dead marking. Crucially, sibling links are not yet updated — other pages still point to this half-dead page.
This is safe because:
A separate operation — often run by VACUUM, not the original deleter — updates the sibling links to skip the half-dead page, then moves the page to the free space map. Only after the sibling pointers are updated is the page truly dead and eligible for reuse.
If we crash between phases, the half-dead flag is a durable breadcrumb. On recovery, PostgreSQL finds half-dead pages and completes their deletion. No leaked pages, no dangling pointers.
The core insight is that page deletion requires three distinct structural changes that can't all fit in one atomic disk write:
| Step | What changes | btree.py | PostgreSQL |
|------|-------------|-----------|------------|
| Remove parent downlink | Parent page | Not done at all in free_page | Phase 1 (WAL-atomic with half-dead flag) |
| Update sibling links | Left and right neighbor pages | Not done (grep for sibling in deletion path returns nothing) | Phase 2 (separate WAL record) |
| Add to free list | Metadata/FSM | free_page line 93–94 | Phase 2 (after siblings updated) |
btree.py jumps straight to the third step — look at the grep results: there are zero matches for merge, rebalance, underflow, redistribute, or sibling in any deletion context. The delete operation (referenced in testdeleteand_reinsert, line 155) removes the key from the leaf but never restructures the tree. Empty leaves persist as wasted space, or are freed without fixing the pointers that reference them.
The grep for lock|latch|concurrent|thread also returns nothing. In PostgreSQL, the two-phase design exists partly because concurrent readers might be traversing sibling links during deletion. But even in a single-threaded system, the crash-safety argument holds: you need a recoverable intermediate state between "page is part of the tree" and "page is on the free list."
The two-phase protocol is an instance of a general pattern: any multi-page structural change needs a crash-recoverable intermediate state. PostgreSQL's half-dead flag is essentially a micro-transaction marker embedded in the page itself. The WAL ensures the flag is durable; the flag tells recovery what operation was in progress; and the second phase is idempotent so it can be safely retried.
btree.py's WAL (lines 112–153) logs individual page writes but has no concept of operation boundaries. There's no beginstructuralchange / endstructuralchange bracketing. The commit method (line 136) just truncates the entire WAL — it's all-or-nothing for the whole transaction, not per-operation. This works for simple puts (write leaf + write metadata), but falls apart for multi-step tree restructuring where you need the WAL to record *intent*, not just *data*.