Topic: DDIA Chapter 3 covers why B+-trees link leaves in a chain (vs. B-trees which store values at every level), and the tradeoffs for sequential scan performance

Date: 2026-05-29

Time: 10:38

B+-Tree Leaf Chains and Sequential Scan Performance

The Core Design Decision

This implementation is a B+-tree, not a plain B-tree. The distinction is visible right in the on-disk format. Look at serializeleaf at b-tree-storage-engine/btree.py:178:


def _serialize_leaf(keys, values, next_sibling=NO_SIBLING):
    buf = struct.pack(HEADER_FMT, LEAF, len(keys))
    buf += struct.pack('>I', next_sibling)
    for k, v in zip(keys, values):
        ...

Every leaf node carries two things a plain B-tree internal node wouldn't: the actual values (not just routing keys) and a nextsibling pointer (a 4-byte page number). The sentinel NOSIBLING = 0xFFFFFFFF at line 14 marks the end of the chain.

Internal nodes, by contrast, store only keys and child page pointers — no values. This is the defining split between B-trees and B+-trees: values live exclusively in leaves.

Why Link Leaves?

In a plain B-tree, if you want to scan keys "cherry" through "grape", you'd do an in-order traversal of the tree — descending into a subtree, visiting a key in an internal node, descending into the next subtree, back up, and so on. Every key visit potentially requires navigating up and down the tree. For a range scan of *N* results in a tree of height *h*, you might touch *O(N × h)* pages.

The B+-tree leaf chain converts this into a linked-list walk. Once you've descended to the first matching leaf (costing *h* page reads), you follow next_sibling pointers horizontally. Each subsequent leaf is one page read. For *N* results spread across *L* leaves, the cost is *h + L* page reads — and the *L* reads are sequential on disk if leaves were allocated in order.

The test at b-tree-storage-engine/test_btree.py:82 exercises this directly:


results = tree.range_scan("k05", "k10")
keys = [k for k, v in results]
assert keys == ["k05", "k06", "k07", "k08", "k09"]

This scan spans multiple leaf pages (with maxkeysperpage=4, 20 keys require at least 5 leaves). The rangescan method descends to "k05", then walks the sibling chain until it hits "k10" — no re-traversal of internal nodes.

The Leaf Format on Disk

The binary layout makes the chain explicit. From writeempty_leaf at line 52:


def _write_empty_leaf(self, page_num):
    data = struct.pack(HEADER_FMT, LEAF, 0)
    data += struct.pack('>I', NO_SIBLING)  # next sibling

Each leaf page is structured as:

| Field | Size | Purpose |

|-------|------|---------|

| Page type | 1B | LEAF=1 vs INTERNAL=0 |

| Key count | 2H | Number of entries |

| Next sibling | 4B | Page number of right neighbor |

| Key-value pairs | variable | Length-prefixed keys and values |

The nextsibling field sits right after the header — it's the first thing read after determining this is a leaf. deserializeleaf (line 189) returns it as the third element of the tuple: (keys, values, nextsibling).

Tradeoffs

What the chain buys you:

What it costs:

The DDIA insight: For workloads dominated by range queries (which most OLTP databases are), the leaf chain's sequential scan advantage far outweighs the extra descent cost on point lookups. This is why virtually every production database uses B+-trees, not B-trees.

What's Missing From These Observations

The observations only include the first 200 lines of btree.py (612 total). To fully trace the leaf chain mechanics, you'd want to read:

Topics to Explore

Beliefs