Function: rebuildindex in hash-index-storage/bitcask.py

Date: 2026-05-28

Time: 18:57

BitcaskStore.rebuildindex

Purpose

This is the startup recovery method for the Bitcask store. When a BitcaskStore is opened against an existing data directory, rebuildindex reconstructs the in-memory hash index (self.keydir) by replaying all data files from disk. This is necessary because Bitcask's entire read path depends on the in-memory index — without it, the store has no way to locate any key's value on disk.

It exists because Bitcask trades startup cost for runtime simplicity: the index is never persisted as a standalone structure (except via optional hint files), so it must be rebuilt on every open.

Contract

Preconditions:

Postconditions:

Invariant: After completion, for every key k in self.keydir, keydir[k] points to the most recent non-tombstone record for k across all scanned files.

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| file_ids | list[int] | Numeric IDs of data files to scan (e.g., [0, 1, 3] corresponding to 0.data, 1.data, 3.data). Gaps are fine. The list does not need to be pre-sorted — the method sorts it internally. |

Edge cases:

Return Value

None. The method mutates self.keydir in place.

Algorithm

1. Sort file IDssorted(file_ids) ensures files are processed oldest-first. This is critical: if key "x" appears in both file 0 and file 2, file 2's entry overwrites file 0's, leaving the latest value in the index.

2. For each file, choose the fast or slow path:

3. The method itself contains no merging logic — ordering guarantees correctness. Each delegate (loadhintfile or scandatafile) writes directly into self.keydir, and later files naturally overwrite stale entries from earlier files.

Side Effects

Error Handling

There is no explicit error handling. The method will propagate:

None of these are caught — a corrupt file means a crash at startup. This is a deliberate design choice for a reference implementation: fail loudly rather than silently lose data.

Usage Patterns

Called exactly once, from _init_, and only when existing data files are found:


existing_ids = self._find_file_ids()
if existing_ids:
    self._rebuild_index(existing_ids)

The caller then sets self.activefileid = max(existing_ids) — the rebuild must complete before the active file is opened for writes. There's no locking; this assumes single-process access (standard Bitcask constraint).

Dependencies

| Dependency | Role |

|------------|------|

| self.hintpath(fid) | Resolves <data_dir>/<fid>.hint |

| self.loadhint_file(fid) | Fast-path index loading from precomputed hint files |

| self.scandata_file(fid) | Slow-path sequential scan of raw data files |

| os.path.exists | Checks for hint file presence |

| self.keydir | Target dict being populated |

Assumptions Not Enforced by Types

1. File IDs correspond to actual files on disk. Nothing prevents passing an ID with no matching .data file — it will crash inside scandata_file.

2. Hint files are consistent with data files. If a hint file is stale or corrupt, the index will be wrong, but no validation is performed.

3. Tombstone handling differs between paths. scandatafile removes tombstoned keys from keydir; loadhintfile does not — hint files are only written during compaction, which already strips tombstones. If a hint file somehow contained a tombstoned key, it would appear as live.

4. Single-writer assumption. No file locking or CRC checks — concurrent writers would corrupt the index silently.