Topic: Compare the CRC boundary choices in bitcask, b-tree WAL, and this WAL to see if the payload-only pattern is a deliberate project convention or independent coincidence

Date: 2026-05-29

Time: 06:28

CRC Boundary Choices Across Three Implementations

The Three CRC Strategies

Each implementation that uses CRC makes a distinct choice about what data the checksum covers:

1. Log-Structured Bitcask (log-structured-hash-table/bitcask.py:155-157)


payload = key_bytes + value
crc = zlib.crc32(payload) & 0xFFFFFFFF
header = struct.pack(HEADER_FMT, crc, len(key_bytes), len(value))

CRC covers: key + value (pure payload). The header fields (crc, keysize, valuesize) are excluded. The CRC sits *inside* the header, protecting only the bytes that follow it.

2. B-Tree WAL (b-tree-storage-engine/btree.py:134-137)


header = struct.pack(self.ENTRY_HEADER, self._seq, page_num, len(page_data))
checksum = struct.pack('>I', self._checksum(page_data))
self._f.write(header + page_data + checksum)

CRC covers: pagedata (pure payload). The header fields (seq, pagenum, data_len) are excluded. Here the CRC is a *trailer* rather than a header field, but the boundary is the same — only the variable-length data blob.

3. Standalone WAL (write-ahead-log/wal.py:30-32)


crc_data = struct.pack("B", op_type_byte) + key + value
crc = zlib.crc32(crc_data) & 0xFFFFFFFF

CRC covers: optype + key + value — payload *plus one header field*. The seqnum, recordlength, keylen, and valuelen are all excluded, but optype_byte is included in both the CRC input and the serialized header.

4. Hash-Index Bitcask (hash-index-storage/bitcask.py)

No CRC at all. The header format <dII contains timestamp, keysize, valuesize with no integrity field. This implementation relies entirely on structural validation (successful parsing) to detect corruption.

Verdict: Independent Coincidence, Not Convention

Three pieces of evidence point to coincidence rather than deliberate coordination:

1. The boundaries aren't actually identical. The standalone WAL breaks the payload-only pattern by pulling optypebyte into the CRC input (wal.py:30). If this were a project convention, you'd expect a shared utility or at least a consistent rule. Instead, one implementation includes a header field and the others don't.

2. The fourth implementation has no CRC at all. The hash-index bitcask (hash-index-storage/bitcask.py) omits integrity checking entirely. A project-wide convention would presumably apply to all storage engines, not three of four.

3. The CRC placement differs. The log-structured bitcask puts CRC in the header (!III — crc first). The b-tree WAL puts it as a trailer (after page_data). The standalone WAL puts it second in the header (<IIQBi — length, then crc). These are structurally different record formats that happen to share a pragmatic instinct.

The shared instinct is what makes this look like a convention at first glance: all three exclude *framing metadata* (lengths, sequence numbers, offsets) from the CRC. This makes practical sense — if a length field is corrupted, the read will either grab the wrong number of bytes (producing a CRC mismatch anyway) or fail structurally. Checksumming lengths provides little additional protection over checksumming the payload alone. This is a well-known pattern in storage format design (e.g., LevelDB's block checksums cover block data + type byte but not the length framing), and each implementation likely arrived at it independently from the same reasoning.

The standalone WAL's inclusion of optypebyte is the telling detail: it protects against the scenario where the framing is intact but the operation type is silently flipped (e.g., a PUT becomes a DELETE). This is a *semantic* concern that doesn't apply to the b-tree WAL (which logs raw pages with no operation semantics) or the bitcask (where key+value already encodes the full intent). Each author chose the CRC boundary that matched their corruption threat model, not a shared rule.

Topics to Explore

Beliefs