allocate_page (which updates metadata) and the subsequent WAL-protected page write — can the page be leaked or double-allocated?Date: 2026-05-29
Time: 07:54
allocate_page and WAL-Protected Write: Page Leak AnalysisShort answer: Yes, pages can be leaked, and under specific timing, double-allocated. This is a genuine crash-safety gap in the B-tree storage engine.
allocate_page (btree.py:82–93) modifies metadata directly in the data file, outside of any WAL protection:
def allocate_page(self):
root, height, total_keys, next_free, free_head = self.read_meta()
# ... (free list path omitted) ...
page_num = next_free
self.write_meta(root, height, total_keys, page_num + 1, free_head) # <-- direct write
return page_num
writemeta calls writemeta (btree.py:40–45), which does self.f.flush() but not os.fsync(). Meanwhile, the WAL's log_write (btree.py:131–137) does both flush() and fsync(). This creates two problems:
1. The allocation is not WAL-logged — it's a bare metadata write to the data file.
2. The metadata write isn't even durable — flush() pushes data to the OS page cache, but without fsync(), a power loss or kernel panic can lose it.
1. allocatepage() bumps nextfree from N to N+1 and flushes metadata.
2. Crash occurs before the caller writes page N's content (either directly or via WAL).
3. On recovery: metadata says next_free = N+1 (if the flush reached disk), so page N will never be allocated again. But page N was never written — it contains zeros or garbage. Page N is leaked.
There is no mechanism to detect this: no allocation bitmap, no page-usage cross-reference, no post-recovery consistency check. The leaked page is silently lost forever.
1. allocatepage() bumps nextfree and flushes — but flush() without fsync() means the OS may not have persisted it.
2. The WAL entry for page N's content IS written and fsynced.
3. Crash occurs.
4. On recovery: WAL replays page N's content correctly. But metadata on disk still says next_free = N (the flush was lost). Next allocation returns page N again — overwriting the recovered data with new content.
When reusing pages from the free list (btree.py:86–91), allocatepage pops from the list by updating freehead in metadata. A crash after this metadata write but before the page is used means the page is removed from the free list but never written to — leaked and unreachable.
The B-tree's embedded WAL class (btree.py:113–160) protects page data writes only. Its recover method (btree.py:142–160) replays (pagenum, pagedata) pairs back into the data file, but it has no knowledge of the metadata page's nextfree or freehead fields. Recovery can restore page content but cannot detect or fix a metadata/allocation inconsistency.
The metadata update and the page write need to be atomic — either both happen or neither does. The standard approach: log the allocation itself in the WAL (or include the updated metadata page as a WAL entry), so recovery can either roll forward the complete operation or roll it back.
allocate-page-not-wal-protected — PageManager.allocate_page modifies metadata directly in the data file without logging the allocation to the WAL, making it vulnerable to page leaks on crashwrite-meta-no-fsync — writemeta calls flush() but not os.fsync(), so metadata updates are not durable against power loss or kernel panicwal-recover-ignores-metadata — WAL.recover replays page data writes but does not validate or restore the metadata page's nextfree or freehead fieldsno-leak-detection-mechanism — The B-tree has no allocation bitmap, page-reachability check, or post-recovery consistency scan to detect leaked or orphaned pagesfree-page-non-atomic — free_page writes the free-list link into the page before updating metadata, creating a window where a crash can orphan the page from both the tree and the free list