Topic: What happens if a crash occurs between allocate_page (which updates metadata) and the subsequent WAL-protected page write — can the page be leaked or double-allocated?

Date: 2026-05-29

Time: 07:54

Crash Between allocate_page and WAL-Protected Write: Page Leak Analysis

Short answer: Yes, pages can be leaked, and under specific timing, double-allocated. This is a genuine crash-safety gap in the B-tree storage engine.

The Core Problem

allocate_page (btree.py:82–93) modifies metadata directly in the data file, outside of any WAL protection:


def allocate_page(self):
    root, height, total_keys, next_free, free_head = self.read_meta()
    # ... (free list path omitted) ...
    page_num = next_free
    self.write_meta(root, height, total_keys, page_num + 1, free_head)  # <-- direct write
    return page_num

writemeta calls writemeta (btree.py:40–45), which does self.f.flush() but not os.fsync(). Meanwhile, the WAL's log_write (btree.py:131–137) does both flush() and fsync(). This creates two problems:

1. The allocation is not WAL-logged — it's a bare metadata write to the data file.

2. The metadata write isn't even durableflush() pushes data to the OS page cache, but without fsync(), a power loss or kernel panic can lose it.

Crash Scenario 1: Page Leak (Most Likely)

1. allocatepage() bumps nextfree from N to N+1 and flushes metadata.

2. Crash occurs before the caller writes page N's content (either directly or via WAL).

3. On recovery: metadata says next_free = N+1 (if the flush reached disk), so page N will never be allocated again. But page N was never written — it contains zeros or garbage. Page N is leaked.

There is no mechanism to detect this: no allocation bitmap, no page-usage cross-reference, no post-recovery consistency check. The leaked page is silently lost forever.

Crash Scenario 2: Double-Allocation (Less Likely, More Dangerous)

1. allocatepage() bumps nextfree and flushes — but flush() without fsync() means the OS may not have persisted it.

2. The WAL entry for page N's content IS written and fsynced.

3. Crash occurs.

4. On recovery: WAL replays page N's content correctly. But metadata on disk still says next_free = N (the flush was lost). Next allocation returns page N again — overwriting the recovered data with new content.

The Free-List Path Has the Same Problem

When reusing pages from the free list (btree.py:86–91), allocatepage pops from the list by updating freehead in metadata. A crash after this metadata write but before the page is used means the page is removed from the free list but never written to — leaked and unreachable.

Why the WAL Doesn't Help

The B-tree's embedded WAL class (btree.py:113–160) protects page data writes only. Its recover method (btree.py:142–160) replays (pagenum, pagedata) pairs back into the data file, but it has no knowledge of the metadata page's nextfree or freehead fields. Recovery can restore page content but cannot detect or fix a metadata/allocation inconsistency.

What a Fix Would Require

The metadata update and the page write need to be atomic — either both happen or neither does. The standard approach: log the allocation itself in the WAL (or include the updated metadata page as a WAL entry), so recovery can either roll forward the complete operation or roll it back.

Beliefs