Topic: Instrument both engines to count total bytes written per logical put, especially under update-heavy workloads with compaction — this quantifies the LSM-tree's main cost

Date: 2026-05-29

Time: 12:34

Instrumenting Write Amplification: LSM-Tree vs B-Tree

Write amplification — the ratio of total physical bytes written to logical bytes intended — is the central cost of LSM-trees. Instrumenting both engines to measure it requires tracing every disk write triggered by a single put, including deferred writes from compaction and page splits.

The Write Paths

LSM-Tree (log-structured-merge-tree/lsm.py)

A single put triggers two immediate writes, plus deferred writes:

1. WAL append (lines 20–26): WAL.append() writes 4 + len(key_bytes) + 4 + len(value) bytes — two 4-byte length headers plus the key and value payloads. This is the synchronous durability cost.

2. Memtable flush (deferred): When the memtable fills, SSTable.write() (lines 80–100) serializes every buffered entry to disk. Each entry costs 4 + len(keybytes) + 4 + len(value) in the data block, plus the sparse index footer entries (key + 8-byte offset every sparseindex_interval entries, then an 8-byte footer-start pointer and 4-byte count).

3. Compaction (deferred): The LSMTree class (starts at line 200, truncated in observations) merges SSTables. Every compaction pass re-writes all entries from the input SSTables into new output SSTables. This is where write amplification compounds — the same key-value pair can be rewritten once per compaction level, giving worst-case amplification of O(levels × size_ratio).

B-Tree (b-tree-storage-engine/btree.py)

A single put writes at coarser granularity:

1. WAL log (lines 128–135): WAL.logwrite() writes 12 + pagesize + 4 bytes — a 12-byte header (seq + pagenum + datalen), the full page image, and a 4-byte CRC32 checksum. For the default 4096-byte page, that's 4112 bytes per WAL entry, regardless of how small the key-value pair is.

2. Page write (lines 78–85): PageManager.writepage() always writes exactly pagesize bytes (4096 default), padded with zeroes. Updating a single byte in a leaf means rewriting the entire page.

3. Metadata update (lines 44–49): writemeta() rewrites the full metadata page (another page_size bytes) whenever the tree structure changes (key count, root pointer, free list).

4. Page splits (deferred): When a leaf overflows, the split allocates a new page and rewrites the parent — at minimum 3 page writes (two leaves + parent), each page_size bytes.

Instrumentation Strategy

B-Tree: Already Partially Instrumented

The PageManager already tracks page I/O at lines 34–35:


self.pages_read = 0
self.pages_written = 0

with a reset_counters() method at lines 107–108. To get total bytes written per put:


bytes_written = pm.pages_written * pm.page_size

What's missing: WAL bytes are not counted. WAL.logwrite() writes 12 + len(pagedata) + 4 bytes per call but has no counter. Add a byteswritten accumulator to the WAL class and expose it alongside PageManager.pageswritten.

LSM-Tree: No Instrumentation Exists

The LSM engine has no write counters at all. You need to instrument three layers:

1. WAL.append() — accumulate 4 + len(k) + 4 + len(value) per call

2. SSTable.write() — after writing, measure the output file size (or accumulate within the for loop at line 87)

3. Compaction — the compaction function (beyond line 200, not shown) creates new SSTables. Instrument the same way as flush: count the bytes of each output SSTable

The key metric is:


write_amplification = total_physical_bytes_written / sum_of_logical_put_sizes

where logicalputsize = len(key) + len(value) for each user-facing put call.

Why This Matters Under Update-Heavy Workloads

When the workload is update-heavy (repeatedly overwriting the same keys):

The crossover: LSM wins when entries are small relative to page size (amortized flush is cheaper than a full-page rewrite), but loses under update-heavy workloads where compaction rewrites dominate. Instrumenting both engines lets you measure exactly where that crossover occurs for your data.

Observations Gaps

The LSMTree class definition (starting at line 200) was truncated — the compaction logic, memtable flush trigger, and put method are not visible. The sstable-and-compaction/sstable.py file shows the SSTable format but the compaction orchestration (size-tiered and leveled merge) was also truncated at line 200. Full instrumentation requires seeing both the compaction loop and the put → flush → compact call chain.

Topics to Explore

Beliefs