Function: loadhint_file in hash-index-storage/bitcask.py

Date: 2026-05-29

Time: 06:49

loadhint_file — Fast Index Recovery from Hint Files

Purpose

loadhintfile reconstructs the in-memory key index (self.keydir) from a precomputed hint file instead of scanning the full data file record-by-record. Hint files are a Bitcask optimization: during compaction, a compact summary of each record's location metadata is written alongside the merged data file. On startup, loading a hint file is O(index entries) with minimal I/O, whereas scandatafile must read and skip over every value payload — potentially gigabytes of data — just to extract the same key-to-location mappings.

Contract

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| self | BitcaskStore | The store instance whose keydir will be populated |

| fileid | int | Numeric ID of the data file whose hint file to load. Maps to {fileid}.hint on disk. |

Return Value

None. The method mutates self.keydir as a side effect.

Algorithm

1. Bulk read: Opens the hint file and reads the entire contents into memory as a single bytes object. This is an intentional choice — hint files are small (they contain no value data), so buffering the whole thing avoids repeated syscalls.

2. Iterate fixed-header records: Walks through the byte buffer using a manual position cursor pos:

a. Unpack the fixed header (24 bytes, HINT_FORMAT = "<IQId"):

b. Read the variable-length key: Unpacks a 4-byte little-endian uint32 for key_size, then reads that many bytes and decodes as UTF-8.

c. Insert into keydir: Creates a KeyEntry(fid, offset, size, ts) and stores it under the decoded key. If the key already exists (from an earlier file's hint), it is silently overwritten — this is correct because rebuildindex processes files in sorted ID order, so later files contain newer data.

Side Effects

Error Handling

None. The method will raise:

No corruption detection (checksums, magic bytes) is performed. The method trusts that the hint file is well-formed.

Usage Patterns

Called exclusively by rebuildindex during store initialization:


def _rebuild_index(self, file_ids):
    for fid in sorted(file_ids):
        if os.path.exists(self._hint_path(fid)):
            self._load_hint_file(fid)    # fast path
        else:
            self._scan_data_file(fid)    # slow path

The caller obligation is simple: only call this when a .hint file is known to exist. The symmetric writer is writehint_file, called during compaction to produce the hint files that this method later consumes.

Dependencies

Notable Assumptions

1. Hint file format includes a key-size field not covered by HINTFORMAT: The 4-byte keysize is packed separately from the fixed header. This means each hint record is actually HINTHEADERSIZE + 4 + len(key_bytes) — a variable-length format with a two-part header. This asymmetry between the format constant and actual layout is easy to miss.

2. No tombstone handling: Unlike scandatafile which checks valsize == 0 to identify tombstones and removes keys from keydir, hint files are only produced during compaction — and compaction already filters out tombstones. So hint files are assumed to contain only live records.

3. fid in the hint record may differ from the file_id parameter: The hint record stores its own fid field (the data file ID where the record lives). In practice after compaction these should match, but the code trusts the embedded fid over the parameter — which is the correct behavior if a hint file ever mapped across data files.

4. UTF-8 keys only: Keys are decoded with .decode("utf-8") with no error handling. Binary keys or other encodings will raise.

Beliefs