Date: 2026-05-29
Time: 10:25
This B-tree implementation (b-tree-storage-engine/btree.py) is entirely single-threaded. PageManager.readpage (line 70) and writepage (line 77) operate directly on a single file handle with no synchronization. allocatepage (line 88) reads metadata, computes the next free page, and writes it back — a classic read-modify-write sequence with no atomicity guarantee. If two threads called allocatepage concurrently, both could allocate the same page number.
The standalone write-ahead-log/wal.py does use a threading.Lock (line 80: self.lock = threading.Lock()), and every mutating method (append at line 139, appendbatch at line 150, checkpoint at line 163) acquires it. But this protects only the WAL's internal state — it doesn't coordinate B-tree structural modifications like splits.
This is fine for a teaching implementation. Production systems need a very different approach.
Production databases distinguish two concepts that this codebase collapses:
SELECT FOR UPDATE are locks.threading.Lock() in wal.py:80 is closer to a latch in spirit, though it protects the WAL rather than tree pages.The B-tree in btree.py needs latches, not locks. The problem isn't transaction isolation — it's that two threads traversing the tree simultaneously could see a half-written page or a node mid-split.
The naive approach — latch the entire tree for every operation — kills concurrency. Production systems use lock coupling (also called "crabbing"):
1. Latch the parent before descending.
2. Latch the child page.
3. Release the parent if the child is "safe" (won't split on insert, won't merge on delete).
4. Repeat down the tree.
Consider a lookup in this implementation's tree. BTree.get would traverse from root to leaf, calling read_page at each level (the test at line 95 asserts reads ≤ height + 1). With lock coupling, the traversal would hold at most two latches at any time — the current node and its parent — releasing the parent once the child is confirmed safe.
For writes, "safe" means the child node has room for an insertion without splitting (i.e., numkeys < maxkeysperpage). If the child might split, you must hold the parent's latch because the split will insert a new separator key into the parent.
Look at what happens during a split in btree.py. The code must:
1. Read the full node (read_page)
2. Allocate a new sibling page (allocate_page, line 88)
3. Write the split contents to both pages (write_page)
4. Insert the separator key into the parent
5. Possibly split the parent too, cascading up to the root
Without latching, a concurrent reader could see the original node after step 3 but before step 4 — the sibling exists but the parent doesn't point to it yet. Keys in the new sibling become invisible. Worse, a concurrent writer might follow a stale pointer and corrupt data.
InnoDB uses a combination of techniques:
PostgreSQL's nbtree uses the Lehman-Yao algorithm, which is specifically designed to reduce latching:
nextsibling field in serializeleaf (btree.py, line 169: buf += struct.pack('>I', nextsibling)). But Lehman-Yao extends this to internal nodes too, not just leaves.The next_sibling field already present in this implementation's leaf serialization is the seed of this idea, but Lehman-Yao requires a high key (the upper bound of keys in each node) and right-links on internal nodes, neither of which this implementation has.
The WAL class in btree.py (line 122) logs page writes before applying them. In a concurrent system, the WAL and latching must coordinate carefully:
wal.py serializes all writes through self.lock (line 139), which effectively makes the WAL a single-writer bottleneck. Production WALs use group commit — batching multiple transactions' WAL records into a single fsync — to amortize this cost. The appendbatch method (line 150) hints at this pattern but applies it within a single transaction rather than across transactions.lehman-yao-right-links — The Lehman-Yao algorithm eliminates parent-to-child latch coupling by adding right-link pointers and high keys to every node; compare with the existing leaf next_sibling field at btree.py:169b-tree-storage-engine/btree.py:allocate_page — This read-modify-write on metadata (lines 88-96) is the most obvious concurrency hazard; adding a latch here reveals the broader page-allocation coordination problemoptimistic-vs-pessimistic-coupling — InnoDB's optimistic approach (S-latch down, restart with X-latch on split) vs. pessimistic (X-latch from root on every write) is a key performance tradeoff worth studyingwrite-ahead-log/wal.py:append_batch — Group commit across transactions (not just within one) is how production WALs avoid making the WAL lock a throughput bottleneckbuffer-pool-integration — Production B-trees don't call readpage/writepage directly on files; they go through a buffer pool with pinning and replacement policies, and latches live on the buffer frame, not the pagebtree-no-concurrency-control — BTree in btree.py has no latches, mutexes, or synchronization of any kind; concurrent access to any BTree method is unsafewal-has-threading-lock — WriteAheadLog in wal.py:80 uses a threading.Lock to serialize append, append_batch, checkpoint, and truncate, making it the only concurrency-aware component in the codebaseleaf-next-sibling-exists — Leaf nodes store a next_sibling pointer (btree.py:169), which is a prerequisite for Lehman-Yao but insufficient without internal-node right-links and high keysallocate-page-race-condition — PageManager.allocate_page (lines 88-96) performs an unprotected read-modify-write on the metadata page; two concurrent callers can allocate the same page number