Date: 2026-05-29
Time: 08:16
This codebase has six distinct storage components, and they fall into three tiers of integrity protection: payload-only CRC, data-only CRC (excluding WAL entry headers), and no CRC at all. None of them protect full records end-to-end.
File: log-structured-hash-table/bitcask.py
The header format at line 10 is "!III" — three unsigned ints: crc32, keysize, valuesize. The CRC is computed over keybytes + value (the payload), but not over keysize or value_size:
crc = zlib.crc32(payload) & 0xFFFFFFFF where payload = key_bytes + valuescansegment, line 99): expected_crc = zlib.crc32(payload) & 0xFFFFFFFFget, line 182): same checkWhat's unprotected: If keysize or valuesize in the header gets corrupted, the reader will slice the payload at wrong boundaries. The CRC will likely fail (because the wrong bytes get hashed), but this is an accidental detection, not a guarantee. A key_size bit-flip that happens to produce a valid CRC over the mis-sliced payload would silently return wrong data.
File: write-ahead-log/wal.py
The CRC input at line 30 is struct.pack("B", optypebyte) + key + value — it covers the operation type and the key/value payload. The header packed at line 33 is "<IIQBi": recordlength, crc, seqnum, optypebyte, len(key).
What's protected: optypebyte, key, value
What's unprotected: recordlength, seqnum, len(key), and len(value) (packed separately at line 34). A corrupted seqnum would be accepted silently. A corrupted recordlength would cause the framing loop to read the wrong number of bytes for record_data, likely producing a short read or garbage — but the failure mode is a crash, not a clean error.
Note that optypebyte is redundantly present in both the CRC input and the header. It's the only header field that gets integrity protection.
File: b-tree-storage-engine/btree.py
The WAL entry format (line 120) is: seq(4B) + pagenum(4B) + datalen(4B) + data + checksum(4B). The checksum at line 133 covers only page_data:
checksum = struct.pack('>I', self._checksum(page_data))
What's protected: The page data bytes written to the WAL.
What's unprotected: seq, pagenum, and datalen. During recovery (line 162–164), a corrupted page_num would cause the recovered page to be written to the wrong location in the B-tree file — a silent, catastrophic corruption. The checksum would pass because the page data itself is fine; only the routing metadata is wrong.
File: log-structured-merge-tree/lsm.py
Both the WAL class (lines 20–25) and SSTable class (lines 85–99) use bare length-prefixed records: struct.pack(">I", len(k)) followed by the key bytes, then struct.pack(">I", len(value)) followed by value bytes. No checksums anywhere. A single bit-flip in a length field causes cascading misframing of all subsequent records.
File: sstable-and-compaction/sstable.py
Has a magic number (SSTB, line 50) for file type validation but no per-entry or per-file checksums. Entry format is [keylen:2][key][timestamp:8][tombstone|valuelen:4+value]. Corruption in any length field silently misframes the scan.
File: hash-index-storage/bitcask.py
Header format at line 10 is "<dII" — timestamp, keysize, valuesize. No CRC field at all. Records are trusted blindly on read (line 97–100).
File: bloom-filter/bloom_filter.py
Line 85 packs m, k, count as a header. The bit array is serialized after. No checksum protects either the header or the filter data.
| Component | CRC? | Covers | Header protected? | Silent corruption risk |
|-----------|------|--------|--------------------|----------------------|
| log-structured-hash-table/bitcask.py | Yes | key + value | No (keysize, valuesize excluded) | Mis-sliced payload |
| write-ahead-log/wal.py | Yes | optype + key + value | Partial (optype only) | Wrong seq_num, bad framing |
| b-tree-storage-engine/btree.py WAL | Yes | page data | No (seq, pagenum, datalen excluded) | Page written to wrong location |
| log-structured-merge-tree/lsm.py | No | — | — | Cascading misframe |
| sstable-and-compaction/sstable.py | No | — | — | Cascading misframe |
| hash-index-storage/bitcask.py | No | — | — | Silent wrong reads |
| bloom-filter/bloom_filter.py | No | — | — | False negatives/positives |
The key finding: no component in this codebase protects full-record integrity. The three that have CRC all exclude framing/routing metadata from the checksum, which means header corruption can cause silent data loss or misrouting even when the payload checksum passes.
b-tree-storage-engine/btree.py:WAL.recover — The most dangerous gap: a corrupted page_num passes the checksum and writes a valid page to the wrong disk locationlog-structured-hash-table/bitcask.py:scansegment — How CRC failure during recovery is handled (stops scanning, potentially losing valid trailing records)length-prefix-framing-resilience — Whether any component can resync after a corrupted length field, or if all suffer cascading misframeshash-index-storage/bitcask.py — Compare this CRC-less Bitcask variant against the CRC-protected one in log-structured-hash-table/ to understand when integrity was deemed necessarysstable-magic-number-vs-crc — The SSTable uses a magic number for file-type validation but no per-entry checksums; explore whether the magic check gives any meaningful corruption protection beyond "wrong file type"bitcask-crc-excludes-header — In log-structured-hash-table/bitcask.py, the CRC32 covers only the key+value payload; the keysize and valuesize header fields are outside the checksum boundarywal-crc-excludes-seq-and-lengths — In write-ahead-log/wal.py, the CRC covers optype + key + value but not recordlength, seqnum, keylen, or value_len, so a corrupted sequence number is accepted silentlybtree-wal-checksum-excludes-page-num — In b-tree-storage-engine/btree.py, the WAL checksum covers only pagedata, meaning a corrupted pagenum during recovery writes a valid page to the wrong location without detectionlsm-and-sstable-have-no-checksums — Neither log-structured-merge-tree/lsm.py nor sstable-and-compaction/sstable.py compute or verify any checksums; a single bit-flip in a length-prefix field causes cascading misframing of all subsequent recordsno-component-has-full-record-crc — Across all six storage components, none computes a CRC over the complete record (header + payload); integrity protection, where it exists, always excludes framing or routing metadata