File: b-tree-storage-engine/btree.py

Date: 2026-05-28

Time: 18:10

b-tree-storage-engine/btree.py

Purpose

This file is a self-contained, disk-backed B-tree storage engine — the kind described in DDIA Chapter 3 as the dominant indexing structure in relational databases. It owns the full stack: page-level I/O, binary serialization of leaf and internal nodes, key lookup/insert/delete with node splitting, range scans via leaf sibling pointers, and write-ahead logging for crash safety. It's a teaching implementation: correct and complete enough to demonstrate the real concepts, but not production-hardened (no concurrency, no buffer pool, no overflow pages for large values).

Key Components

Constants & Wire Formats

PageManager

Low-level fixed-size page I/O. Owns the data file handle and tracks read/write counts for observability.

WAL (Write-Ahead Log)

Crash recovery via a sequential log file. Each entry is (seq, pagenum, datalen, data, crc32).

Serialization Functions

Four free functions handle the binary format of leaf and internal pages:

BTree

The main API. Wraps PageManager + WAL and exposes a key-value interface.

Core operations:

Utilities: _contains, len, iter, minkey, maxkey, stats, printtree.

TreeStats

A simple data class holding tree dimensions and I/O counters — useful for testing that operations touch the expected number of pages.

Patterns

WAL-then-write protocol. Every page mutation follows the sequence: log to WAL (with fsync) → write to data file → eventually commit (fsync data, truncate WAL). The walwritepage and walwritemeta methods enforce this. This is the standard crash-recovery pattern from DDIA Chapter 3.

Recursive insert with split propagation. _insert returns a discriminated union (None | 'inserted' | tuple) that bubbles up through the call stack. Each level handles its own split locally. The root-split case is handled at the top level in put(), not inside the recursion — this keeps the recursion clean.

Free list as an intrusive linked list. Freed pages store the "next free" pointer in their own body (at offset HEADER_SIZE). No separate data structure needed.

Leaf sibling chain. Leaves store a next_sibling pointer enabling efficient range scans without touching internal nodes. During splits, the new right leaf inherits the old left leaf's sibling pointer, and the left leaf is updated to point to the new right leaf.

Counter reset per operation. pm.resetcounters() is called at the start of each public method, so stats.pagesread/written reflects the most recent operation only.

Dependencies

Imports: Only stdlib — os (file I/O), struct (binary packing), zlib (CRC32 checksums), bisect (sorted-array search). Zero external dependencies.

Imported by: The two test files (testbtree.py, testertest_btree.py) exercise the BTree API. Nothing else in the repo depends on this module.

Flow

Write path (put)

1. Reset I/O counters.

2. Read metadata (root page, height, total key count).

3. Call _insert(root, key, value, height) recursively:

4. If the root itself split, allocate a new root with one key and two children, increment height.

5. Update metadata with new total key count.

6. Commit WAL (fsync data file, truncate WAL).

Read path (get)

1. Read metadata for root and height.

2. Walk from root to leaf:

3. Return the value bytes or None.

Recovery path (on BTree._init_)

1. Open data file and WAL file.

2. wal.recover(pm): replay all valid WAL entries (checksum-verified) to the data file, fsync, truncate WAL.

3. Tree is now in a consistent state.

Invariants

1. Page 0 is always metadata. The root page is never page 0.

2. Keys within every node are sorted. All lookup, insert, and range-scan logic depends on this.

3. Internal nodes with n keys have exactly n+1 children. The serialization format encodes child_0 first, then interleaves (key, child) pairs.

4. Leaf sibling pointers form a left-to-right chain terminated by NO_SIBLING. Splits maintain this: right inherits the old sibling, left points to right.

5. totalkeys in metadata is authoritative. len_ reads it directly; insert/delete increment/decrement it atomically with the WAL commit.

6. Every page write is preceded by a WAL entry. The walwritepage / walwritemeta methods enforce this coupling.

7. WAL commit is the atomic boundary. If the process crashes mid-operation, recovery replays the WAL, producing a state equivalent to "all writes completed." If the WAL is empty/truncated, the data file is already consistent.

Error Handling

Topics to Explore

Beliefs