Function: scansegment in log-structured-hash-table/bitcask.py

Date: 2026-05-28

Time: 19:04

scansegment — Segment File Scanner for Index Recovery

Purpose

scansegment rebuilds the in-memory hash index by sequentially reading every record from a single segment file on disk. It's the crash-recovery path: when the store starts up (or when rebuild_index is called explicitly) and no hint file exists for a segment, this method replays the segment's write log to reconstruct which keys live where.

This is the slow path. The fast path is loadhintfile, which reads a pre-built index. scan_segment exists because hint files are optional — they're only created for frozen (compacted) segments, not the active one. On first startup after a crash, the active segment has no hint file, so the store must scan it record by record.

Contract

Preconditions:

Postconditions:

Invariants:

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| self | BitcaskStore | The store instance; self._index is mutated |

| seg_path | str | Absolute path to a .dat segment file |

No validation is performed on segpath — if the file doesn't exist, the open() call raises FileNotFoundError. This is intentional: the callers (recover, rebuildindex) only pass paths discovered by findexistingsegments.

Return Value

None. The method communicates its results entirely through mutation of self._index.

Algorithm

1. Open the segment for binary reading.

2. Loop until the file is exhausted or a corrupt/incomplete record is found:

Side Effects

Error Handling

The method swallows all data integrity errors silently by breaking out of the loop:

No exception is raised, no warning is logged. The design philosophy is that a partial or corrupt record at the tail of a segment is expected after a crash — the process died mid-write. Everything before the bad record is assumed valid. This is safe because writes are append-only and flushed after each record (writerecord calls flush()).

However, if seg_path doesn't exist or isn't readable, open() raises a standard OS exception that propagates to the caller.

Usage Patterns

Called in two contexts:

1. Startup recovery (_recover): scans each segment that lacks a hint file, in segment-ID order.

2. Manual rebuild (rebuildindex): called after self.index.clear() to reconstruct from scratch.

Caller obligation: segments must be scanned in chronological order. If you scan segment 3 before segment 1, you'll get the wrong index — an older value for a key could overwrite the newer one.

Dependencies

| Dependency | Usage |

|------------|-------|

| struct (stdlib) | Unpacking the binary header format !III |

| zlib (stdlib) | CRC32 checksum computation |

| HEADERFMT, HEADERSIZE | Module-level constants defining the record header layout |

| TOMBSTONE | Sentinel value b"_BITCASKTOMBSTONE__" for deletion markers |

Assumptions Not Enforced by Types

1. Segment ordering is the caller's responsibility. Nothing prevents calling scansegment out of order, which would silently corrupt the index.

2. Keys are valid UTF-8. The .decode("utf-8") will raise UnicodeDecodeError if a key contains invalid bytes — this error is not caught and will propagate, aborting recovery.

3. Tombstone values are never legitimately stored. If a user calls put(key, b"_BITCASKTOMBSTONE__"), the key becomes invisible. There's no escaping mechanism.

4. Corruption only occurs at the tail. The break-on-CRC-mismatch strategy assumes corruption doesn't happen mid-file with valid records after it. A bit flip in the middle of a segment would cause all subsequent valid records to be silently dropped from the index.

Beliefs