log-structured-hash-table hint format has no way to represent tombstones (it's just key_size + offset), so deleted keys from compaction may also have subtle semantics worth tracingDate: 2026-05-29
Time: 08:53
log-structured-hash-tableThe hint file format at log-structured-hash-table/bitcask.py:13-14 stores only two fields per entry:
HINT_ENTRY_FMT = "!II" # key_size, offset
Meanwhile, tombstones are represented as a magic value in the data file (bitcask.py:9):
TOMBSTONE = b"__BITCASK_TOMBSTONE__"
The hint format has no field to distinguish "this offset points to a live record" from "this offset points to a tombstone." This creates a divergence between two recovery paths that should produce identical results.
Path A — Scanning the data file (scansegment, lines 96–114) correctly handles tombstones:
if value == TOMBSTONE:
self._index.pop(key, None) # deleted key is removed from the index
else:
self._index[key] = (seg_path, offset)
Path B — Loading a hint file (loadhint_file, lines 116–127) blindly trusts every entry:
key = key_bytes.decode("utf-8")
self._index[key] = (seg_path, offset) # always adds — no tombstone check possible
These two paths are selected during recovery (_recover, lines 82–96): if a .hint file exists for a segment, it's used; otherwise the segment is scanned. The intent is that both paths produce the same index. For tombstones, they don't.
Consider this sequence:
1. Segment 1 contains put("user", b"Alice")
2. Segment 2 contains delete("user") — writes TOMBSTONE value
3. Hint files are generated for both segments
On recovery:
user → (seg1, offset) to the indexuser → (seg2, tombstone_offset) to the index, overwriting segment 1's entryNow get("user") reads the record at tombstoneoffset, finds the payload is b"BITCASKTOMBSTONE_", and returns it as the value. The get method (lines 164–179) performs no tombstone check — it returns payload[keysize:] unconditionally. A deleted key has come back to life, and worse, its value is the raw tombstone sentinel bytes.
Had scansegment processed segment 2 instead, it would have called self._index.pop("user", None), correctly removing the key.
This bug is likely latent rather than actively triggered, because compaction should strip tombstones from the output segment — dead keys simply aren't written to the merged file. If createhintfiles() is only called on compacted segments (as the test at lines 104–113 suggests), the hint files would never reference tombstone records.
But the code offers no structural guarantee of this. createhintfiles() and compact() are independent public methods. Calling createhintfiles() on an uncompacted segment that contains deletions would produce a hint file with phantom entries.
hash-index-storageThe sibling implementation at hash-index-storage/bitcask.py has the same structural gap — its loadhint_file (lines ~148–160) also unconditionally adds entries. But it has a defense-in-depth layer: its get() method (lines ~175–182) checks if value == "": return None, so even if a tombstone leaks into the index through a hint file, reads return None rather than sentinel bytes. The log-structured-hash-table implementation lacks this second guard.
The compact() and createhintfiles() methods (likely beyond line 200 of bitcask.py) weren't included. Whether compaction writes tombstones to the output segment — and whether createhintfiles iterates the index or scans the segment — determines whether this bug is practically reachable or only theoretical.