Date: 2026-05-29
Time: 06:27
This codebase uses two distinct integrity strategies, and the split is deliberate — it mirrors the threat models of the systems being implemented.
Three storage engines use zlib.crc32 to detect accidental corruption — bit flips, partial writes, torn pages:
write-ahead-log/wal.py:31) — CRC covers op_type + key + value, stored alongside each record. On read (wal.py:54), the CRC is recomputed and compared; a mismatch raises ValueError at line 56.log-structured-hash-table/bitcask.py:142) — CRC32 is the first field in every record header (bitcask.py:10: HEADERFMT = "!III" # crc32, keysize, value_size). Reads validate at bitcask.py:99 and bitcask.py:182.b-tree-storage-engine/btree.py:176) — A _checksum method wraps zlib.crc32, used to protect WAL entries for page-level recovery (btree.py:133, verified at btree.py:164).CRC32 is the right tool here. These are single-node storage engines where the threat is hardware — a flipped bit on disk, a kernel that didn't flush, a crash mid-write. CRC32 is fast (hardware-accelerated on most CPUs), adds only 4 bytes per record, and catches random corruption with high reliability.
But CRC32 is not cryptographic. It's a linear function — given a message and its CRC, an attacker can compute the CRC of a modified message without knowing the original. This means:
For a single-process storage engine running on a trusted local disk, none of this matters. The adversary model is "cosmic rays and power failures," not "someone editing my data files."
The Merkle tree (merkle-tree/merkle_tree.py:11-12) uses SHA-256 for a fundamentally different reason:
def _sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
Merkle trees exist for anti-entropy — comparing data across replicas that may be on different machines, controlled by different processes, separated by a network. The threat model shifts from accidental corruption to:
1. Untrusted replicas — In a distributed system, you can't assume the other node is honest. A replica could have been compromised, or could be running buggy software that silently corrupts data.
2. Tamper-evident proofs — verifyproof at merkletree.py:169 lets a client verify that a specific data block is part of a tree *without downloading the entire dataset*. This only works if the hash function is collision-resistant — if an attacker could find two blocks with the same hash, the proof would be meaningless.
3. Cross-network verification — When hashes travel over a network, an adversary could intercept and modify them. SHA-256's preimage resistance makes this computationally infeasible.
Two things stand out:
The SSTable implementation (sstable-and-compaction/sstable.py) has no integrity checks at all — no CRC, no hash. It validates the magic bytes (MAGIC = b"SSTB" at line 9, checked at line 117), but that only catches "this isn't an SSTable file," not "this SSTable has been corrupted." A single bit flip in a value would be silently returned. Real SSTable implementations (LevelDB, RocksDB) include per-block CRC32 checksums for exactly this reason.
The hash-index Bitcask variant (hash-index-storage/bitcask.py) also lacks CRC protection, unlike its sibling in log-structured-hash-table/bitcask.py. It uses a simpler HEADERFORMAT = "<dII" (timestamp, keysize, value_size) with no checksum field. A corrupted record would be read without error.
You'd switch from CRC32 to a cryptographic hash when the trust boundary changes:
| Scenario | CRC32 | SHA-256 |
|----------|-------|---------|
| Local WAL on a trusted disk | Sufficient | Overkill |
| Replicated WAL across nodes | Insufficient | Needed |
| SSTable served by a trusted process | Sufficient (if present) | Overkill |
| SSTable downloaded from a remote replica | Insufficient | Needed |
| B-tree page cache on local disk | Sufficient | Overkill |
| Backup verification after network transfer | Insufficient | Needed |
The pattern: CRC32 protects against accidents on a trusted path. SHA-256 protects against malice or distrust across a boundary. The codebase already demonstrates both — it just applies them to different components based on the appropriate threat model.
merkle-tree/merkletree.py:verifyproof — How Merkle inclusion proofs work and why collision resistance matters for their security guaranteesstable-and-compaction/sstable.py — Notable absence of any integrity checksums; compare with LevelDB's block-level CRC designhash-index-storage/bitcask.py — Compare this CRC-less Bitcask variant with the checksummed version in log-structured-hash-table/bitcask.pybyzantine-fault-tolerance — How PBFT uses compute_digest to handle integrity in a fully adversarial environment where nodes may liewrite-ahead-log/wal.py:encoderecord — Study what exactly the CRC covers (optype + key + value but not seqnum) and why that choice matters for corruption detectionwal-crc-covers-payload-not-header — The WAL CRC at wal.py:31 checksums optype + key + value but not the sequence number or record length, meaning header corruption (e.g., wrong seqnum) would go undetectedsstable-has-no-integrity-check — sstable.py validates the magic bytes but has no per-entry or per-block checksum, so data corruption in SSTable files would be silently returned to the callermerkle-tree-uses-sha256-for-cross-replica-verification — The Merkle tree uses SHA-256 specifically because it supports tamper-evident proofs (verify_proof) across trust boundaries, a property CRC32 cannot providehash-index-bitcask-lacks-crc — hash-index-storage/bitcask.py has no CRC field in its header format (<dII), unlike log-structured-hash-table/bitcask.py which includes CRC32 as the first header field (!III)