Date: 2026-05-29
Time: 07:26
loadhint_file — Fast Index Recovery from Precomputed Summariesloadhint_file reconstructs the in-memory hash index from a hint file instead of scanning the full segment data file. This is a startup optimization straight from the Bitcask paper: hint files contain only the keys and their offsets, stripping out the values entirely. For a segment with large values, this can reduce recovery I/O by orders of magnitude — you read kilobytes of hints instead of megabytes of data.
It exists because the alternative — scansegment — must read every byte of every record (key + value + CRC) just to extract the key and offset. Hint files are a precomputed projection of exactly the information the index needs.
Preconditions:
hintpath points to an existing, readable file in the hint format ([keysize: u32, offset: u32, key_bytes: var] repeated)seg_path is the corresponding .dat segment file — the offsets stored in the hint file are byte positions within this segmentcreatehintfiles)Postconditions:
self.index contains a (segpath, offset) entry for every key found in the hint file_recover processes segments oldest-to-newest)Invariant assumed but not enforced: The hint file is consistent with the segment file. If the hint file is corrupt or out of sync, the index will contain garbage offsets that will cause CorruptionError on read.
| Parameter | Type | Meaning |
|-----------|------|---------|
| hint_path | str | Absolute path to the .hint file to read |
| seg_path | str | Absolute path to the corresponding .dat segment — not opened here, just stored as the index value |
Neither parameter is validated. A nonexistent hintpath raises FileNotFoundError. A wrong segpath silently corrupts the index.
None. All output is via mutation of self._index.
1. Open the hint file in binary read mode.
2. Loop: read a fixed-size header (HINTHEADERSIZE = 8 bytes, two u32 network-order integers).
3. Unpack the header into key_size (length of the key in bytes) and offset (byte position of the full record in the segment file).
4. Read key_size bytes for the key. If a short read occurs, break — the file is truncated or done.
5. Decode the key from UTF-8 and insert (segpath, offset) into self.index.
6. Repeat until a header read returns fewer than HINTHEADERSIZE bytes.
The critical thing: offset here is the offset in the segment file, not in the hint file. When get() later uses this offset, it seeks to that position in seg_path and reads the full record (header + key + value + CRC check).
self._index — overwrites entries for any keys already present. This is intentional: during recovery, segments are processed oldest-first, so newer segments' entries correctly shadow older ones.with block, so the hint file handle is closed on exit.Errors are handled via silent truncation, not exceptions:
break out of the loop. This treats any partial write at the tail as end-of-file rather than corruption.CorruptionError when get() reads the segment and the CRC check fails.keysize is absurdly large) will cause a large allocation in f.read(keysize), potentially an OverflowError or MemoryError. No bounds checking.Called from exactly two sites:
1. recover() (line ~73) — during startup, for each segment that has a .hint file alongside it. Falls back to scan_segment if no hint file exists.
2. rebuild_index() (line ~230) — explicit full index rebuild, same hint-or-scan logic.
In both cases, the caller iterates segments in sorted order (by segment ID), so last-write-wins semantics in the index produce the correct result.
The hint files themselves are created by createhintfiles(), which only writes entries for non-tombstoned keys whose index still points to that segment — so loadhint_file never sees tombstones and unconditionally inserts every entry it reads.
struct — binary unpacking with HINTENTRYFMT = "!II" (network byte order, two unsigned 32-bit ints)HINTHEADERSIZE — precomputed as struct.calcsize(HINTENTRYFMT) = 8 bytesself._index — the in-memory dict[str, tuple[str, int]] that is the entire point of Bitcask's design1. Keys are valid UTF-8. key_bytes.decode("utf-8") will raise UnicodeDecodeError on malformed data — this is the one failure mode that is not silently swallowed.
2. Hint files are trustworthy. No integrity check means a corrupted hint file is worse than a missing one — missing triggers a full scan with CRC checks, while a corrupted hint silently loads bad data.
3. Offsets fit in 32 bits. The !II format caps segment files at 4 GiB. This matches the maxsegmentsize default of 1 MiB, but nothing prevents a caller from setting a larger size.
4. No tombstone entries in hint files. loadhintfile does not check for tombstones (unlike scansegment). This is correct only because createhint_files filters them out at write time — the correctness of this reader depends on the writer's behavior.