Original source: topic-tombstone-handling-in-hint-files

Topic: The `log-structured-hash-table` hint format has no way to represent tombstones (it's just key_size + offset), so deleted keys from compaction may also have subtle semantics worth tracing

Date: 2026-05-29

Time: 08:53

The Hint File Tombstone Gap in `log-structured-hash-table`

The Core Problem

The hint file format at log-structured-hash-table/bitcask.py:13-14 stores only two fields per entry:


HINT_ENTRY_FMT = "!II"  # key_size, offset

Meanwhile, tombstones are represented as a magic value in the data file (bitcask.py:9):


TOMBSTONE = b"__BITCASK_TOMBSTONE__"

The hint format has no field to distinguish "this offset points to a live record" from "this offset points to a tombstone." This creates a divergence between two recovery paths that should produce identical results.

The Two Recovery Paths

Path A — Scanning the data file (scansegment, lines 96–114) correctly handles tombstones:


if value == TOMBSTONE:
    self._index.pop(key, None)   # deleted key is removed from the index
else:
    self._index[key] = (seg_path, offset)

Path B — Loading a hint file (loadhint_file, lines 116–127) blindly trusts every entry:


key = key_bytes.decode("utf-8")
self._index[key] = (seg_path, offset)   # always adds — no tombstone check possible

These two paths are selected during recovery (_recover, lines 82–96): if a .hint file exists for a segment, it's used; otherwise the segment is scanned. The intent is that both paths produce the same index. For tombstones, they don't.

How Deleted Keys Resurrect

Consider this sequence:

1. Segment 1 contains put("user", b"Alice")

2. Segment 2 contains delete("user") — writes TOMBSTONE value

3. Hint files are generated for both segments

On recovery:

Segment 1's hint file adds user → (seg1, offset) to the index
Segment 2's hint file also adds user → (seg2, tombstone_offset) to the index, overwriting segment 1's entry

Now get("user") reads the record at tombstoneoffset, finds the payload is b"BITCASKTOMBSTONE_", and returns it as the value. The get method (lines 164–179) performs no tombstone check — it returns payload[keysize:] unconditionally. A deleted key has come back to life, and worse, its value is the raw tombstone sentinel bytes.

Had scansegment processed segment 2 instead, it would have called self._index.pop("user", None), correctly removing the key.

Why Compaction Masks This (Usually)

This bug is likely latent rather than actively triggered, because compaction should strip tombstones from the output segment — dead keys simply aren't written to the merged file. If createhintfiles() is only called on compacted segments (as the test at lines 104–113 suggests), the hint files would never reference tombstone records.

But the code offers no structural guarantee of this. createhintfiles() and compact() are independent public methods. Calling createhintfiles() on an uncompacted segment that contains deletions would produce a hint file with phantom entries.

Contrast with `hash-index-storage`

The sibling implementation at hash-index-storage/bitcask.py has the same structural gap — its loadhint_file (lines ~148–160) also unconditionally adds entries. But it has a defense-in-depth layer: its get() method (lines ~175–182) checks if value == "": return None, so even if a tombstone leaks into the index through a hint file, reads return None rather than sentinel bytes. The log-structured-hash-table implementation lacks this second guard.

What's Missing from the Observations

The compact() and createhintfiles() methods (likely beyond line 200 of bitcask.py) weren't included. Whether compaction writes tombstones to the output segment — and whether createhintfiles iterates the index or scans the segment — determines whether this bug is practically reachable or only theoretical.

Topic: The log-structured-hash-table hint format has no way to represent tombstones (it's just key_size + offset), so deleted keys from compaction may also have subtle semantics worth tracing

The Hint File Tombstone Gap in log-structured-hash-table