Date: 2026-05-29
Time: 10:38
This implementation is a B+-tree, not a plain B-tree. The distinction is visible right in the on-disk format. Look at serializeleaf at b-tree-storage-engine/btree.py:178:
def _serialize_leaf(keys, values, next_sibling=NO_SIBLING):
buf = struct.pack(HEADER_FMT, LEAF, len(keys))
buf += struct.pack('>I', next_sibling)
for k, v in zip(keys, values):
...
Every leaf node carries two things a plain B-tree internal node wouldn't: the actual values (not just routing keys) and a nextsibling pointer (a 4-byte page number). The sentinel NOSIBLING = 0xFFFFFFFF at line 14 marks the end of the chain.
Internal nodes, by contrast, store only keys and child page pointers — no values. This is the defining split between B-trees and B+-trees: values live exclusively in leaves.
In a plain B-tree, if you want to scan keys "cherry" through "grape", you'd do an in-order traversal of the tree — descending into a subtree, visiting a key in an internal node, descending into the next subtree, back up, and so on. Every key visit potentially requires navigating up and down the tree. For a range scan of *N* results in a tree of height *h*, you might touch *O(N × h)* pages.
The B+-tree leaf chain converts this into a linked-list walk. Once you've descended to the first matching leaf (costing *h* page reads), you follow next_sibling pointers horizontally. Each subsequent leaf is one page read. For *N* results spread across *L* leaves, the cost is *h + L* page reads — and the *L* reads are sequential on disk if leaves were allocated in order.
The test at b-tree-storage-engine/test_btree.py:82 exercises this directly:
results = tree.range_scan("k05", "k10")
keys = [k for k, v in results]
assert keys == ["k05", "k06", "k07", "k08", "k09"]
This scan spans multiple leaf pages (with maxkeysperpage=4, 20 keys require at least 5 leaves). The rangescan method descends to "k05", then walks the sibling chain until it hits "k10" — no re-traversal of internal nodes.
The binary layout makes the chain explicit. From writeempty_leaf at line 52:
def _write_empty_leaf(self, page_num):
data = struct.pack(HEADER_FMT, LEAF, 0)
data += struct.pack('>I', NO_SIBLING) # next sibling
Each leaf page is structured as:
| Field | Size | Purpose |
|-------|------|---------|
| Page type | 1B | LEAF=1 vs INTERNAL=0 |
| Key count | 2H | Number of entries |
| Next sibling | 4B | Page number of right neighbor |
| Key-value pairs | variable | Length-prefixed keys and values |
The nextsibling field sits right after the header — it's the first thing read after determining this is a leaf. deserializeleaf (line 189) returns it as the third element of the tuple: (keys, values, nextsibling).
What the chain buys you:
WHERE key BETWEEN x AND y.testbasic (line 33: allkeys = [k for k, v in tree]) walks the entire leaf chain without touching a single internal node after the initial descent.What it costs:
next_sibling pointer — trivial overhead per page.nextsibling, and the old leaf's nextsibling is rewritten to point to the new leaf. This is one extra page write per split.testpageio_counts test at line 99 validates this: reads should equal height + 1 at most.The DDIA insight: For workloads dominated by range queries (which most OLTP databases are), the leaf chain's sequential scan advantage far outweighs the extra descent cost on point lookups. This is why virtually every production database uses B+-trees, not B-trees.
The observations only include the first 200 lines of btree.py (612 total). To fully trace the leaf chain mechanics, you'd want to read:
rangescan method — to see how it follows nextsibling pointers_iter_ method — to see full-tree iteration via the leaf chainserializeinternal / deserializeinternal — to confirm internal nodes store only keys + child pointers, no valuesb-tree-storage-engine/btree.py:range_scan — How the leaf sibling chain is actually walked during range queries; this is where the O(h + L) cost manifestsb-tree-storage-engine/btree.py:serializeinternal — Confirm internal nodes store only routing keys and child pointers, completing the B+-tree pictureleaf-split-chain-maintenance — When a leaf overflows and splits, how the sibling pointers are rewired; a bug here silently breaks range scansb-tree-storage-engine/btree.py:_iter_ — Full-table scan implementation; likely descends to the leftmost leaf then walks the chain to the endsequential-io-and-page-allocation — Whether allocate_page (line 93) tends to produce physically contiguous leaves, which determines whether the logical chain translates to actual sequential disk I/Oleaf-nodes-store-all-values — Values are stored exclusively in leaf nodes; internal nodes contain only routing keys and child page pointers, making this a B+-treeleaf-sibling-chain-exists — Every leaf carries a 4-byte next_sibling page pointer (sentinel 0xFFFFFFFF for the rightmost leaf), forming a singly-linked chain across all leavesrange-scan-uses-leaf-chain — range_scan descends to the start key's leaf then follows sibling pointers horizontally, never re-traversing internal nodespoint-lookup-always-reaches-leaf — Every get must descend to a leaf node regardless of tree height, since internal nodes hold no values; cost is exactly *h* page reads for a tree of height *h*