Original source: b-tree-storage-engine-test_btree

File: b-tree-storage-engine/test_btree.py

Date: 2026-05-28

Time: 18:28

`b-tree-storage-engine/test_btree.py`

Purpose

This is the test suite for a disk-backed B-Tree storage engine — one of the core data structure implementations from *Designing Data-Intensive Applications*. It validates the B-Tree as a complete key-value store with on-disk persistence, write-ahead logging, and crash recovery. The file owns correctness verification across the full surface area: CRUD operations, range queries, page-level I/O behavior, durability guarantees, and data integrity (CRC32 checksums).

Key Components

The suite is 11 standalone test functions, no class hierarchy — each creates a fresh BTree in a tempfile.TemporaryDirectory so tests are fully isolated.

Core functionality tests

testbasic — The integration smoke test. Exercises put, get, contains_, range scan, iteration order, min/max key, update-in-place, delete, stats, and persistence across close/reopen. This single test covers more surface area than most of the others combined.
testlarge — Inserts 1000 keys with zero-padded names (key00000–key_00999) and verifies all retrievals, sorted iteration, and that the tree has grown beyond height 1. Confirms the B-Tree scales past trivial sizes.
testrangescan — Tests bounded (k05 to k10, exclusive end) and unbounded (k15 to end) range scans. Validates that range boundaries are half-open: start-inclusive, end-exclusive.
testdeleteand_reinsert — Deletes every even-indexed key, reinserts them with new values, and verifies both the reinserted and untouched keys. Confirms the tree handles tombstone-to-live transitions correctly.

Durability and crash recovery tests

testwalrecovery — Simulates a crash by closing raw file handles (tree.pm.f, tree.wal.f) without calling tree.close(). Reopens and verifies all 10 keys survived. This tests that the WAL replays uncommitted page writes on startup.
testwaluncommittedentries — Goes further: after a clean close, manually appends a WAL entry (via WAL.logwrite) *without* a commit marker, then reopens. Verifies the tree still loads correctly — uncommitted WAL entries are replayed but shouldn't corrupt committed state.
testcrc32detects_corruption — Writes a WAL entry, then flips a byte inside it to corrupt the CRC. On reopen, the corrupted entry must be silently skipped while previously committed data remains intact.

Structural integrity tests

testpageio_counts — After inserting 100 keys, asserts that a single get reads at most height + 1 pages. Validates that lookups follow the B-Tree's O(log n) page-read guarantee, not scanning.
testtoolargekv — Confirms that inserting a value larger than the page can hold raises ValueError. The tree uses pagesize=128 and a 200-byte value to trigger this.
testdeletefreesemptyleaf — Deletes keys from a non-root leaf and verifies the tree stays consistent: sorted iteration works, range scans skip deleted keys, and the gap doesn't break traversal.
testmetadataconsistencyaftersplit — After 20 inserts (forcing multiple splits), reads the metadata page directly (tree.readmeta()) and asserts the root pointer references a valid INTERNAL node. Reopens and verifies metadata is identical — the root pointer, height, and key count survive persistence.

Patterns

Temp-directory isolation. Every test uses with tempfile.TemporaryDirectory() as tmpdir — the B-Tree writes real files to disk, so this gives each test a clean filesystem namespace with automatic cleanup.

Close-reopen persistence checks. Several tests (testbasic, testwalrecovery, testmetadataconsistencyafter_split) close the tree and reopen it from the same directory to verify durability. This is the signature pattern for testing storage engines.

Internal-API probing. Tests like testpageiocounts reach into tree.pm.pagesread, and testmetadataconsistencyaftersplit calls tree.readmeta() and pagetype(). These aren't black-box tests — they verify internal invariants that a pure API test can't catch.

Lazy imports. Several tests import B-Tree internals (WAL, serializeleaf, pagetype, INTERNAL, HEADER_FMT, LEAF) inside the function body rather than at module scope. This keeps the top-level import surface minimal and makes it clear which tests depend on internals.

maxkeysper_page=4. Nearly every test uses this small fanout to force splits with few keys, making tree height and structure predictable for assertions.

Dependencies

Imports:

tempfile, os, struct — standard library, for test scaffolding and low-level WAL manipulation
btree.BTree — the system under test
btree.WAL, btree.serializeleaf, btree.HEADERFMT, btree.LEAF, btree.page_type, btree.INTERNAL — internal APIs used only in durability/corruption tests

Imported by: Nothing — this is a leaf test module. The companion file testertestbtree.py likely wraps or re-runs these tests in a harness.

Flow

Each test follows the same arc:

1. Create a temp directory

2. Instantiate BTree(tmpdir, maxkeysper_page=4) — this either creates fresh files or opens existing ones

3. Perform writes (put), reads (get), deletes (delete), or scans (range_scan, iteration)

4. Assert correctness of results, stats, or internal state

5. Optionally close and reopen to test persistence

6. tree.close() (or simulate crash by skipping it)

7. Print "test_X PASSED" — the tests use print-based reporting, not a framework like pytest

When run as _main_, all 11 tests execute sequentially, and a final "All tests passed!" confirms the full suite.

Invariants

Sorted iteration: every test that iterates the tree asserts allkeys == sorted(allkeys) — the B-Tree must maintain key order.
Range scan boundaries: rangescan(start, end) is start-inclusive, end-exclusive — testrange_scan asserts "k05" is included but "k10" is not.
Page read bound: a single lookup must touch at most height + 1 pages (one per tree level plus possibly the data page).
Value size limit: keys/values that exceed page_size must raise ValueError, never silently corrupt.
WAL crash safety: data committed before a crash must survive; uncommitted WAL entries must not corrupt existing data.
CRC integrity: corrupted WAL entries are skipped, not applied.
Metadata consistency: readmeta() must return accurate root page, height, and key count, and these must survive close/reopen.

Error Handling

The tests themselves don't handle errors — they assert expected outcomes and let AssertionError propagate on failure. The one exception is testtoolargekv, which uses a try/except to verify that BTree.put raises ValueError for oversized values. WAL corruption (testcrc32detectscorruption) verifies that the *implementation* handles errors silently — a corrupted entry is skipped, not propagated to the caller.

Topics to Explore

[file] b-tree-storage-engine/btree.py — The implementation being tested: page manager, WAL, split logic, and the serialization format (serializeleaf, HEADER_FMT)
[function] b-tree-storage-engine/btree.py:BTree.range_scan — How half-open range iteration is implemented across leaf pages
[function] b-tree-storage-engine/btree.py:WAL.logwrite — WAL entry format, CRC32 checksumming, and the commit protocol that testwaluncommittedentries probes
[file] b-tree-storage-engine/testertestbtree.py — The test harness wrapper; understanding how it relates to this file clarifies the project's test infrastructure
[general] btree-split-and-merge-strategy — How the B-Tree handles node splits (tested by testmetadataconsistencyaftersplit) and whether it implements merge-on-delete or lazy tombstones

Beliefs

btree-range-scan-half-open — BTree.range_scan(start, end) uses half-open intervals: start-inclusive, end-exclusive; passing no end scans to the last key
btree-max-keys-forces-splits — With maxkeysper_page=4, inserting 5+ keys forces at least one split, producing a tree of height >= 2
btree-wal-replays-without-commit — The WAL replays all valid entries on recovery regardless of whether a commit marker is present
btree-crc32-skips-corrupt-entries — Corrupted WAL entries (bad CRC) are silently skipped during recovery rather than raising an error
btree-lookup-reads-bounded-by-height — A single BTree.get reads at most height + 1 disk pages