Date: 2026-05-28
Time: 19:08
No problem — the observations have enough context to give a thorough explanation.
log-structured-merge-tree/lsm.py:range_scan — The dict-based merge at lines 274–298; refactoring to a heapq-based streaming iterator would be a great exercisesstable-and-compaction/sstable.py:SSTableReader.get — Uses binary search on the sparse index (line 200+), which is the per-SSTable lookup that a merge iterator would call next() onleveldb-merging-iterator — LevelDB's table/merger.cc implements the canonical heap-based k-way merge; comparing it to this Python impl reveals what production correctness requires (direction reversal, snapshot reads, seek optimization)log-structured-merge-tree/lsm.py:compact — The compaction merge at lines 323–347 materializes all entries; converting to streaming would demonstrate O(k) memory compactionbloom-filter-optimization — Production LSMs avoid touching SSTables entirely for missing keys using bloom filters; this complements the merge iterator by reducing k before the heap is even builtlsm-range-scan-materializes-all — lsm.py:range_scan loads all matching entries from every SSTable and memtable into a dict before returning any results, making first-result latency proportional to total result set sizeheapq-imported-but-unused-for-merge — lsm.py imports heapq (line 6) but uses dict-based merging for range scans and list-based merging for compaction; no heap-based k-way merge is implementedsstable-scan-is-streaming — SSTable.scan() and scan_all() use Python generators (yield), meaning individual SSTables already support streaming — only the cross-SSTable merge layer materializescompaction-buffers-all-entries — lsm.py:compact collects all merged entries into a list (line 334) before writing the output SSTable, requiring O(n) memory proportional to total data rather than O(k) proportional to number of input SSTablesnewer-overwrites-via-priority-counter — The range scan merge uses an incrementing priority counter (lines 277–293) where higher priority = newer source; the dict naturally deduplicates by overwriting older entries for the same key