Date: 2026-05-28
Time: 19:04
scansegment — Segment File Scanner for Index Recoveryscansegment rebuilds the in-memory hash index by sequentially reading every record from a single segment file on disk. It's the crash-recovery path: when the store starts up (or when rebuild_index is called explicitly) and no hint file exists for a segment, this method replays the segment's write log to reconstruct which keys live where.
This is the slow path. The fast path is loadhintfile, which reads a pre-built index. scan_segment exists because hint files are optional — they're only created for frozen (compacted) segments, not the active one. On first startup after a crash, the active segment has no hint file, so the store must scan it record by record.
Preconditions:
segpath must point to a valid, readable file on disk containing zero or more Bitcask records in the binary format [crc32 | keysize | valuesize | keybytes | value_bytes].self.index must already exist (initialized by init_)._recover) guarantees this ordering.Postconditions:
self.index to (segpath, byte_offset).self._index entirely — including entries that may have been placed there by scanning earlier segments.with block).Invariants:
| Parameter | Type | Description |
|-----------|------|-------------|
| self | BitcaskStore | The store instance; self._index is mutated |
| seg_path | str | Absolute path to a .dat segment file |
No validation is performed on segpath — if the file doesn't exist, the open() call raises FileNotFoundError. This is intentional: the callers (recover, rebuildindex) only pass paths discovered by findexistingsegments.
None. The method communicates its results entirely through mutation of self._index.
1. Open the segment for binary reading.
2. Loop until the file is exhausted or a corrupt/incomplete record is found:
f.tell() — this is the byte position that will go into the index.crc, keysize, valuesize.keysize + valuesize bytes). If the read is short, the record was partially written (crash mid-write); stop scanning.key_size bytes of the payload (UTF-8).b"_BITCASKTOMBSTONE_", remove the key from the index. Otherwise, insert/overwrite the index entry with (segpath, offset).self._index: adds, updates, or removes entries. This is the whole point.The method swallows all data integrity errors silently by breaking out of the loop:
No exception is raised, no warning is logged. The design philosophy is that a partial or corrupt record at the tail of a segment is expected after a crash — the process died mid-write. Everything before the bad record is assumed valid. This is safe because writes are append-only and flushed after each record (writerecord calls flush()).
However, if seg_path doesn't exist or isn't readable, open() raises a standard OS exception that propagates to the caller.
Called in two contexts:
1. Startup recovery (_recover): scans each segment that lacks a hint file, in segment-ID order.
2. Manual rebuild (rebuildindex): called after self.index.clear() to reconstruct from scratch.
Caller obligation: segments must be scanned in chronological order. If you scan segment 3 before segment 1, you'll get the wrong index — an older value for a key could overwrite the newer one.
| Dependency | Usage |
|------------|-------|
| struct (stdlib) | Unpacking the binary header format !III |
| zlib (stdlib) | CRC32 checksum computation |
| HEADERFMT, HEADERSIZE | Module-level constants defining the record header layout |
| TOMBSTONE | Sentinel value b"_BITCASKTOMBSTONE__" for deletion markers |
1. Segment ordering is the caller's responsibility. Nothing prevents calling scansegment out of order, which would silently corrupt the index.
2. Keys are valid UTF-8. The .decode("utf-8") will raise UnicodeDecodeError if a key contains invalid bytes — this error is not caught and will propagate, aborting recovery.
3. Tombstone values are never legitimately stored. If a user calls put(key, b"_BITCASKTOMBSTONE__"), the key becomes invisible. There's no escaping mechanism.
4. Corruption only occurs at the tail. The break-on-CRC-mismatch strategy assumes corruption doesn't happen mid-file with valid records after it. A bit flip in the middle of a segment would cause all subsequent valid records to be silently dropped from the index.
scan-segment-ordering-invariant — scansegment must be called on segments in ascending segment-ID order for the index to correctly reflect last-write-wins semantics; no runtime check enforces this.corruption-terminates-scan — A CRC mismatch or short read stops scanning the entire segment; records after the corrupt point are silently lost from the index, even if they are individually valid.tombstone-removes-cross-segment — A tombstone record in segment N removes the key from the index even if the key's live value was written in an earlier segment, because the index is a flat dict shared across all scans.scan-segment-is-read-only — scansegment never modifies any file on disk; its only side effect is mutating self._index.utf8-decode-unguarded — If a segment contains a key with invalid UTF-8 bytes, scansegment raises an unhandled UnicodeDecodeError, aborting recovery for all subsequent segments.