Topic: Audit all struct.pack/CRC sites across the codebase (87 matches found) to identify which components protect only payload vs. full-record integrity

Date: 2026-05-29

Time: 08:16

Integrity Protection Audit: Payload-Only vs. Full-Record CRC

This codebase has six distinct storage components, and they fall into three tiers of integrity protection: payload-only CRC, data-only CRC (excluding WAL entry headers), and no CRC at all. None of them protect full records end-to-end.

Components With CRC Protection

1. Log-Structured Hash Table (Bitcask) — Payload-only CRC

File: log-structured-hash-table/bitcask.py

The header format at line 10 is "!III" — three unsigned ints: crc32, keysize, valuesize. The CRC is computed over keybytes + value (the payload), but not over keysize or value_size:

What's unprotected: If keysize or valuesize in the header gets corrupted, the reader will slice the payload at wrong boundaries. The CRC will likely fail (because the wrong bytes get hashed), but this is an accidental detection, not a guarantee. A key_size bit-flip that happens to produce a valid CRC over the mis-sliced payload would silently return wrong data.

2. Write-Ahead Log — Partial-record CRC

File: write-ahead-log/wal.py

The CRC input at line 30 is struct.pack("B", optypebyte) + key + value — it covers the operation type and the key/value payload. The header packed at line 33 is "<IIQBi": recordlength, crc, seqnum, optypebyte, len(key).

What's protected: optypebyte, key, value

What's unprotected: recordlength, seqnum, len(key), and len(value) (packed separately at line 34). A corrupted seqnum would be accepted silently. A corrupted recordlength would cause the framing loop to read the wrong number of bytes for record_data, likely producing a short read or garbage — but the failure mode is a crash, not a clean error.

Note that optypebyte is redundantly present in both the CRC input and the header. It's the only header field that gets integrity protection.

3. B-Tree WAL — Data-only CRC (header excluded)

File: b-tree-storage-engine/btree.py

The WAL entry format (line 120) is: seq(4B) + pagenum(4B) + datalen(4B) + data + checksum(4B). The checksum at line 133 covers only page_data:


checksum = struct.pack('>I', self._checksum(page_data))

What's protected: The page data bytes written to the WAL.

What's unprotected: seq, pagenum, and datalen. During recovery (line 162–164), a corrupted page_num would cause the recovered page to be written to the wrong location in the B-tree file — a silent, catastrophic corruption. The checksum would pass because the page data itself is fine; only the routing metadata is wrong.

Components With No CRC

4. LSM Tree — No integrity checks

File: log-structured-merge-tree/lsm.py

Both the WAL class (lines 20–25) and SSTable class (lines 85–99) use bare length-prefixed records: struct.pack(">I", len(k)) followed by the key bytes, then struct.pack(">I", len(value)) followed by value bytes. No checksums anywhere. A single bit-flip in a length field causes cascading misframing of all subsequent records.

5. SSTable (standalone) — No integrity checks

File: sstable-and-compaction/sstable.py

Has a magic number (SSTB, line 50) for file type validation but no per-entry or per-file checksums. Entry format is [keylen:2][key][timestamp:8][tombstone|valuelen:4+value]. Corruption in any length field silently misframes the scan.

6. Hash-Index Bitcask Store — No integrity checks

File: hash-index-storage/bitcask.py

Header format at line 10 is "<dII"timestamp, keysize, valuesize. No CRC field at all. Records are trusted blindly on read (line 97–100).

7. Bloom Filter — No integrity checks

File: bloom-filter/bloom_filter.py

Line 85 packs m, k, count as a header. The bit array is serialized after. No checksum protects either the header or the filter data.

Summary Matrix

| Component | CRC? | Covers | Header protected? | Silent corruption risk |

|-----------|------|--------|--------------------|----------------------|

| log-structured-hash-table/bitcask.py | Yes | key + value | No (keysize, valuesize excluded) | Mis-sliced payload |

| write-ahead-log/wal.py | Yes | optype + key + value | Partial (optype only) | Wrong seq_num, bad framing |

| b-tree-storage-engine/btree.py WAL | Yes | page data | No (seq, pagenum, datalen excluded) | Page written to wrong location |

| log-structured-merge-tree/lsm.py | No | — | — | Cascading misframe |

| sstable-and-compaction/sstable.py | No | — | — | Cascading misframe |

| hash-index-storage/bitcask.py | No | — | — | Silent wrong reads |

| bloom-filter/bloom_filter.py | No | — | — | False negatives/positives |

The key finding: no component in this codebase protects full-record integrity. The three that have CRC all exclude framing/routing metadata from the checksum, which means header corruption can cause silent data loss or misrouting even when the payload checksum passes.

Topics to Explore

Beliefs