Function: loadhint_file in log-structured-hash-table/bitcask.py

Date: 2026-05-29

Time: 07:26

loadhint_file — Fast Index Recovery from Precomputed Summaries

Purpose

loadhint_file reconstructs the in-memory hash index from a hint file instead of scanning the full segment data file. This is a startup optimization straight from the Bitcask paper: hint files contain only the keys and their offsets, stripping out the values entirely. For a segment with large values, this can reduce recovery I/O by orders of magnitude — you read kilobytes of hints instead of megabytes of data.

It exists because the alternative — scansegment — must read every byte of every record (key + value + CRC) just to extract the key and offset. Hint files are a precomputed projection of exactly the information the index needs.

Contract

Preconditions:

Postconditions:

Invariant assumed but not enforced: The hint file is consistent with the segment file. If the hint file is corrupt or out of sync, the index will contain garbage offsets that will cause CorruptionError on read.

Parameters

| Parameter | Type | Meaning |

|-----------|------|---------|

| hint_path | str | Absolute path to the .hint file to read |

| seg_path | str | Absolute path to the corresponding .dat segment — not opened here, just stored as the index value |

Neither parameter is validated. A nonexistent hintpath raises FileNotFoundError. A wrong segpath silently corrupts the index.

Return Value

None. All output is via mutation of self._index.

Algorithm

1. Open the hint file in binary read mode.

2. Loop: read a fixed-size header (HINTHEADERSIZE = 8 bytes, two u32 network-order integers).

3. Unpack the header into key_size (length of the key in bytes) and offset (byte position of the full record in the segment file).

4. Read key_size bytes for the key. If a short read occurs, break — the file is truncated or done.

5. Decode the key from UTF-8 and insert (segpath, offset) into self.index.

6. Repeat until a header read returns fewer than HINTHEADERSIZE bytes.

The critical thing: offset here is the offset in the segment file, not in the hint file. When get() later uses this offset, it seeks to that position in seg_path and reads the full record (header + key + value + CRC check).

Side Effects

Error Handling

Errors are handled via silent truncation, not exceptions:

Usage Patterns

Called from exactly two sites:

1. recover() (line ~73) — during startup, for each segment that has a .hint file alongside it. Falls back to scan_segment if no hint file exists.

2. rebuild_index() (line ~230) — explicit full index rebuild, same hint-or-scan logic.

In both cases, the caller iterates segments in sorted order (by segment ID), so last-write-wins semantics in the index produce the correct result.

The hint files themselves are created by createhintfiles(), which only writes entries for non-tombstoned keys whose index still points to that segment — so loadhint_file never sees tombstones and unconditionally inserts every entry it reads.

Dependencies

Notable Assumptions

1. Keys are valid UTF-8. key_bytes.decode("utf-8") will raise UnicodeDecodeError on malformed data — this is the one failure mode that is not silently swallowed.

2. Hint files are trustworthy. No integrity check means a corrupted hint file is worse than a missing one — missing triggers a full scan with CRC checks, while a corrupted hint silently loads bad data.

3. Offsets fit in 32 bits. The !II format caps segment files at 4 GiB. This matches the maxsegmentsize default of 1 MiB, but nothing prevents a caller from setting a larger size.

4. No tombstone entries in hint files. loadhintfile does not check for tombstones (unlike scansegment). This is correct only because createhint_files filters them out at write time — the correctness of this reader depends on the writer's behavior.