HEADER_FMT, CRC, key-length, value-length, payload) is central to tests 4 and 6; understanding the exact byte layout explains the magic offsetsDate: 2026-05-29
Time: 08:34
Every record in the log-structured hash table is a fixed-size header followed by a variable-size payload. The format is defined at log-structured-hash-table/bitcask.py:10-11:
HEADER_FMT = "!III" # crc32, key_size, value_size
HEADER_SIZE = struct.calcsize(HEADER_FMT) # = 12 bytes
The !III format string means: network byte order (big-endian), three unsigned 32-bit integers. Each I is 4 bytes, so the header is always exactly 12 bytes. A single record on disk looks like this:
0 4 8 12 12+K 12+K+V
├─────────┼─────────┼────────┼─────────┼─────────┤
│ CRC32 │ key_len │val_len │ key │ value │
│ 4 bytes │ 4 bytes │4 bytes │ K bytes │ V bytes │
└─────────┴─────────┴────────┴─────────┴─────────┘
◄──────── HEADER ─────────────►◄───── PAYLOAD ────►
(12 bytes fixed) (variable)
Total record size = 12 + len(keybytes) + len(valuebytes).
Writing (bitcask.py:138-145): The CRC is computed over the payload only (key bytes concatenated with value bytes), not over the header. The header is then prepended and the whole thing is written atomically:
payload = key_bytes + value
crc = zlib.crc32(payload) & 0xFFFFFFFF
header = struct.pack(HEADER_FMT, crc, len(key_bytes), len(value))
self._active_file.write(header + payload)
Reading (bitcask.py:170-185): The reader seeks to the offset stored in the in-memory index, reads exactly 12 bytes of header to learn the sizes, reads keysize + valuesize bytes of payload, then recomputes the CRC and compares:
crc, key_size, value_size = struct.unpack(HEADER_FMT, header)
payload = f.read(key_size + value_size)
expected_crc = zlib.crc32(payload) & 0xFFFFFFFF
if crc != expected_crc:
raise CorruptionError(...)
test_bitcask.py:53-58)
def test_segment_rotation(store):
for i in range(20):
store.put(f"key{i}", b"x" * 10)
assert store.num_segments >= 2
The fixture creates a store with maxsegmentsize=100. Each record is:
| Component | Size |
|-----------|------|
| Header | 12 bytes |
| Key ("key0" .. "key9") | 4 bytes |
| Value (b"x" * 10) | 10 bytes |
| Total per record | 26 bytes |
After 3 records: 78 bytes (under 100). After 4 records: 104 bytes (over 100). So rotation triggers on the 4th write. The threshold of 100 isn't arbitrary — it's chosen so that 12 + 4 + 10 = 26 bytes per record guarantees rotation within a handful of writes. Without knowing the 12-byte header overhead, 100 looks like a magic number; with it, it's obviously ~3.8 records per segment.
test_bitcask.py:82-91)
for i in range(10):
store.put("key", f"value_{i}".encode()) # 10 overwrites = 9 stale
for i in range(10):
store.put(f"other{i}", b"data")
Again maxsegmentsize=100. The first loop writes "key" (3 bytes) + "value0" through "value9" (7-8 bytes each), so each record is 12 + 3 + 7 = 22 bytes (or 23 for two-digit indices). About 4 records fit per segment before rotation. All 10 overwrites of "key" create 9 stale records spread across multiple segments. The second loop adds more records with similar sizing. After compaction, only the latest "key" entry and the 10 "other*" entries survive — the 9 stale records (each ~22 bytes = ~198 bytes of dead data) are eliminated, explaining the measurable drop in totaldiskusage.
test_bitcask.py:131)The "magic offset" here is explicit:
f.seek(offset + HEADER_SIZE + 2) # inside the key/value payload
offset + HEADER_SIZE lands exactly at the first byte of the payload (start of the key). Adding 2 puts you inside the key/value data. Since the CRC covers the payload but not the header, flipping any byte past offset + 12 invalidates the checksum while leaving the header (and its recorded CRC) untouched — guaranteeing a CorruptionError on the next get().
The CRC is computed over keybytes + valuebytes but not the header fields (keysize, valuesize). This means a corrupted size field wouldn't be caught by the CRC — instead, it would cause a wrong-length read, which would almost certainly produce a CRC mismatch anyway (reading the wrong bytes). This is a pragmatic simplification common in Bitcask-style implementations: the header is small and fixed, so corruption there manifests as garbage sizes that either fail to read or fail the CRC check on the misaligned payload.