Date: 2026-05-29
Time: 06:28
Each implementation that uses CRC makes a distinct choice about what data the checksum covers:
log-structured-hash-table/bitcask.py:155-157)
payload = key_bytes + value
crc = zlib.crc32(payload) & 0xFFFFFFFF
header = struct.pack(HEADER_FMT, crc, len(key_bytes), len(value))
CRC covers: key + value (pure payload). The header fields (crc, keysize, valuesize) are excluded. The CRC sits *inside* the header, protecting only the bytes that follow it.
b-tree-storage-engine/btree.py:134-137)
header = struct.pack(self.ENTRY_HEADER, self._seq, page_num, len(page_data))
checksum = struct.pack('>I', self._checksum(page_data))
self._f.write(header + page_data + checksum)
CRC covers: pagedata (pure payload). The header fields (seq, pagenum, data_len) are excluded. Here the CRC is a *trailer* rather than a header field, but the boundary is the same — only the variable-length data blob.
write-ahead-log/wal.py:30-32)
crc_data = struct.pack("B", op_type_byte) + key + value
crc = zlib.crc32(crc_data) & 0xFFFFFFFF
CRC covers: optype + key + value — payload *plus one header field*. The seqnum, recordlength, keylen, and valuelen are all excluded, but optype_byte is included in both the CRC input and the serialized header.
hash-index-storage/bitcask.py)No CRC at all. The header format <dII contains timestamp, keysize, valuesize with no integrity field. This implementation relies entirely on structural validation (successful parsing) to detect corruption.
Three pieces of evidence point to coincidence rather than deliberate coordination:
1. The boundaries aren't actually identical. The standalone WAL breaks the payload-only pattern by pulling optypebyte into the CRC input (wal.py:30). If this were a project convention, you'd expect a shared utility or at least a consistent rule. Instead, one implementation includes a header field and the others don't.
2. The fourth implementation has no CRC at all. The hash-index bitcask (hash-index-storage/bitcask.py) omits integrity checking entirely. A project-wide convention would presumably apply to all storage engines, not three of four.
3. The CRC placement differs. The log-structured bitcask puts CRC in the header (!III — crc first). The b-tree WAL puts it as a trailer (after page_data). The standalone WAL puts it second in the header (<IIQBi — length, then crc). These are structurally different record formats that happen to share a pragmatic instinct.
The shared instinct is what makes this look like a convention at first glance: all three exclude *framing metadata* (lengths, sequence numbers, offsets) from the CRC. This makes practical sense — if a length field is corrupted, the read will either grab the wrong number of bytes (producing a CRC mismatch anyway) or fail structurally. Checksumming lengths provides little additional protection over checksumming the payload alone. This is a well-known pattern in storage format design (e.g., LevelDB's block checksums cover block data + type byte but not the length framing), and each implementation likely arrived at it independently from the same reasoning.
The standalone WAL's inclusion of optypebyte is the telling detail: it protects against the scenario where the framing is intact but the operation type is silently flipped (e.g., a PUT becomes a DELETE). This is a *semantic* concern that doesn't apply to the b-tree WAL (which logs raw pages with no operation semantics) or the bitcask (where key+value already encodes the full intent). Each author chose the CRC boundary that matched their corruption threat model, not a shared rule.
write-ahead-log/wal.py:encoderecord — Why optypebyte is included in the CRC but seq_num is not, and what corruption scenarios each choice defends againstb-tree-storage-engine/btree.py:recover — How the b-tree WAL handles partial writes and whether trailing CRC placement interacts differently with torn-write recovery than header CRChash-index-bitcask-no-crc — Whether the absence of CRC in hash-index-storage/bitcask.py is a deliberate simplification or an oversight, and what happens on silent corruptioncrc32-vs-xxhash-choice — All three use zlib.crc32; whether this is adequate for the file sizes involved or whether a faster/stronger hash would change the designlog-structured-hash-table/bitcask.py:scansegment — How CRC failure at line 101 triggers a break (stop scanning) rather than skipping the record, and the implications for data after a corrupted recordpayload-only-crc-is-coincidence — The three CRC implementations independently arrived at similar payload-centric boundaries rather than following a project convention; the standalone WAL's inclusion of op_type in its CRC input is the clearest evidence of independent designhash-index-bitcask-has-no-integrity-check — The hash-index-storage bitcask (hash-index-storage/bitcask.py) has no CRC or checksum field, relying entirely on structural parse success to detect corruptionwal-crc-covers-op-type-unlike-others — The standalone WAL (write-ahead-log/wal.py:30) uniquely includes optypebyte in its CRC input, protecting against silent operation-type corruption that the other implementations leave undetectedbtree-wal-uses-trailer-crc — The b-tree WAL places its checksum as a trailer after page_data (btree.py:136), while the other implementations embed CRC in the record header — a structural difference that affects torn-write detection behavior