Topic: Whether the absence of CRC in hash-index-storage/bitcask.py is a deliberate simplification or an oversight, and what happens on silent corruption

Date: 2026-05-29

Time: 08:13

CRC Absence in hash-index-storage/bitcask.py: Deliberate Simplification

The absence of CRC is a deliberate simplification, not an oversight. The evidence across the codebase makes this clear through contrast.

The two Bitcask implementations tell different stories

The repo contains two separate Bitcask implementations at different fidelity levels:

1. log-structured-hash-table/bitcask.py — includes full CRC32 integrity checking. The header format at line 8 is "!III" (crc32, keysize, valuesize), it computes checksums via zlib.crc32 in writerecord (line 148), validates on read in get() (lines 172–180), and defines a dedicated CorruptionError exception (line 18). During recovery in scansegment (line 96), a CRC mismatch halts scanning — treating corruption as an end-of-valid-data marker.

2. hash-index-storage/bitcask.py — has no CRC at all. Its header format at line 10 is "<dII" (timestamp, keysize, valuesize). No import zlib, no checksum field, no integrity validation anywhere.

This isn't a case of someone forgetting one line. The entire integrity subsystem — the header field, the error class, the check-on-read path, the check-on-scan path — is absent. That's a design choice, not a missed detail.

The repo stratifies implementations by concern

Other modules in the repo follow the same pattern of having CRC where integrity is the *point* of the module:

The hash-index-storage/ module focuses on the hash-index data structure — the keydir, append-only log layout, compaction, hint files. CRC would be orthogonal to what it's demonstrating.

What happens on silent corruption

Without CRC, silent corruption produces wrong answers with no error signal:

The log-structured-hash-table variant handles all of these: scansegment (line 103) stops at the first CRC mismatch, and get() (line 178) raises CorruptionError rather than returning bad data.

Summary

The hash-index-storage/bitcask.py is a teaching implementation focused on the keydir/append-log/compaction architecture from DDIA Chapter 3. CRC was intentionally left out to keep the record format and code paths simple. For a production-grade version with integrity checking, the repo provides log-structured-hash-table/bitcask.py.

Beliefs