Original source: topic-index-size-vs-scan-cost-benchmarking

Topic: Run the `testlargedataset` test with different `block_size` values (4, 16, 64, 256) and measure actual lookup latency to see where the O(log(N/B) + B) tradeoff inflects

Date: 2026-05-29

Time: 13:18

Neither `testlargedataset` nor `block_size` Exist in This Codebase

The observations confirm this: both grep searches returned zero matches. This investigation as literally stated cannot be run — it needs to be adapted to the actual API.

What the codebase actually offers

The B-tree's branching factor is controlled by maxkeysperpage, not blocksize. You can see this in the test constructors throughout b-tree-storage-engine/testbtree.py — every test uses BTree(tmpdir, maxkeysperpage=4) (e.g., lines 9, 58, 80). The closest existing test is testlarge (line 57), which inserts 1,000 keys with maxkeysperpage=4 and verifies all lookups.

The PageManager class (b-tree-storage-engine/btree.py:28) also accepts a pagesize parameter (default 4096), which caps how many keys physically fit on a page regardless of maxkeysperpage.

Why the O(log(N/B) + B) model doesn't quite apply here

The import at btree.py:6 tells the real story:


from bisect import bisect_left, bisect_right

This implementation uses binary search within nodes, not linear scan. That makes the actual lookup cost O(log(N/B) · log(B)), not O(log(N/B) + B). The additive B term in the classical model assumes sequential scan through a node's keys — bisect_left eliminates that.

This means the "inflection point" the investigation is looking for will be much less dramatic than the textbook predicts. With bisect, larger B almost always wins (shorter tree, fewer page reads) because the in-node cost grows logarithmically rather than linearly. The tradeoff only reasserts itself when B gets large enough that serialization/deserialization cost dominates — which depends on pagesize and key/value sizes, not just maxkeysperpage.

What you'd actually need to do

To run this investigation, you'd write a new test that:

1. Parameterizes maxkeysper_page over [4, 16, 64, 256]

2. Inserts a large dataset (10K+ keys) for each value

3. Calls pm.resetcounters() (btree.py:112) before each lookup, then reads pm.pagesread after — this gives you I/O cost directly

4. Times lookups with time.perf_counter() for wall-clock latency

But you'd also need to increase pagesize proportionally for large maxkeysperpage values (256 keys won't fit in a 4096-byte page with any reasonable key/value sizes), which introduces a confound: larger pages mean more data per I/O operation.

The deeper issue

The testpageiocounts test (line 95) already validates the core invariant: lookups read at most height + 1 pages. Since height = ceil(logB(N)), increasing B reduces height and thus page reads. The interesting experiment isn't the lookup inflection — it's the insert throughput tradeoff, because larger nodes mean more data to serialize and rewrite on every split (serializeleaf at line 192, serializeinternal which would be further in the file).

Topics to Explore

[function] b-tree-storage-engine/btree.py:_search — The actual lookup traversal logic; confirm it uses bisect and count page reads per level
[function] b-tree-storage-engine/btree.py:serializeleaf — Serialization cost grows linearly with keys-per-page, which is where the real large-B penalty lives
[general] page-size-vs-max-keys-interaction — How pagesize caps effective maxkeysperpage and whether the code enforces or silently truncates
[function] b-tree-storage-engine/btree.py:PageManager.reset_counters — The instrumentation hook for measuring I/O; any benchmark should use this
[general] split-cost-at-high-branching-factor — Insert splits rewrite entire nodes; at B=256, each split serializes 128+ key-value pairs, making insert latency the real tradeoff

Beliefs

btree-uses-bisect-not-linear-scan — In-node key lookup uses bisectleft/bisectright, making within-node search O(log B) rather than the textbook O(B)
no-block-size-parameter-exists — The branching factor is controlled by maxkeysperpage, not blocksize; no testlargedataset test exists
page-io-counters-are-resettable — PageManager tracks pagesread and pageswritten with a reset_counters() method, enabling per-operation I/O measurement
page-size-constrains-max-keys — pagesize (default 4096) physically limits how many keys fit per node; maxkeysperpage=256 would require a proportionally larger page_size to avoid ValueError on oversized entries

Topic: Run the testlargedataset test with different block_size values (4, 16, 64, 256) and measure actual lookup latency to see where the O(log(N/B) + B) tradeoff inflects

Neither testlargedataset nor block_size Exist in This Codebase