Function: compact in log-structured-hash-table/bitcask.py

Date: 2026-05-28

Time: 18:57

BitcaskStore.compact — Log Segment Compaction

Purpose

compact is the garbage collection mechanism for this Bitcask store. Because the store is append-only, updates to a key don't overwrite the old value — they write a new record to the active segment. Deletions write a tombstone rather than removing anything. Over time, frozen (closed) segments accumulate stale records: superseded values and tombstoned keys. compact merges all frozen segments into a single new segment containing only the live records, then deletes the old segments to reclaim disk space.

This is the operation described in DDIA Chapter 3 as "compaction" for hash-indexed log-structured storage — the defining maintenance operation that makes append-only writes practical.

Contract

Preconditions:

Postconditions:

Invariant preserved: The active segment always has the highest segment ID on disk. This is critical for recovery (_recover assumes the last segment is active).

Parameters

None beyond self. The method operates on internal state.

Return Value

Returns int — the number of stale records removed (total records scanned minus live records written). A return of 0 with no frozen segments means there was nothing to compact. A return of 0 with frozen segments would mean every record in every frozen segment was still live (unlikely but possible).

Algorithm

The method runs in five phases:

Phase 1 — Scan frozen segments to find live entries:

Iterates every record in every frozen segment (oldest to newest). Builds a liveentries dict mapping each key to its most recent (segmentpath, offset). Tombstones remove keys from liveentries. The oldest-to-newest ordering matters: later writes for the same key overwrite earlier ones, so the final state of liveentries reflects the most recent value per key across all frozen segments.

Phase 2 — Filter against the authoritative index:

Not every entry in liveentries is truly live. A key might have been overwritten *in the active segment* after the frozen segment was closed. The filter self.index[key] == (seg_path, offset) checks whether the in-memory index still points to this frozen-segment record. If the index points elsewhere (e.g., the active segment), the frozen record is stale and excluded.

Phase 3 — Write compacted segment:

Closes the active file, bumps the segment counter, and opens a new compacted segment file. For each entry to keep, it re-reads the value from the original segment, recomputes the CRC, and writes a fresh record to the compacted segment. It builds newindexentries with the new offsets.

Phase 4 — Clean up old segments:

Updates _index with the new locations, closes cached read handles for old segments, and deletes the old frozen segment files along with any associated hint files.

Phase 5 — Restore the active segment with the highest ID:

Bumps the segment counter again, renames the old active segment file to the new highest ID, updates all index entries that pointed to the old active path, and reopens the active file for appending. This ensures the active segment has a higher ID than the compacted segment, preserving the recovery invariant.

Side Effects

This method is heavy on side effects:

Error Handling

There is essentially none. If any open, read, write, os.remove, or os.rename call fails, the exception propagates uncaught. This leaves the store in a partially compacted state:

The CRC in the compacted segment is recomputed fresh, so compaction doesn't propagate corrupted data — but it also doesn't *verify* the CRC of records it reads from old segments during compaction (unlike scansegment, which does).

Usage Patterns

Called in two ways:

1. Automatically from put() when the number of frozen segments reaches autocompact_threshold (default 5).

2. Manually by the caller for explicit maintenance.

Caller obligations: don't call compact concurrently with reads or writes. The method temporarily closes the active file, so any interleaved put or get during compaction would fail or corrupt state.

Dependencies

Assumptions Not Enforced by Types

1. Single-threaded access — no locks protect index, active_file, or the segment counter during the multi-step mutation sequence.

2. Segment ordering — the scan assumes frozen segments are returned in creation order so that later writes correctly shadow earlier ones. frozensegmentpaths delegates to findexistingsegments which sorts by ID, so this holds as long as IDs increase monotonically.

3. No external modification — the method assumes no other process modifies segment files on disk between the scan and the delete.

4. Active segment not includedfrozensegmentpaths excludes the active path by identity comparison (path != self.active_path). If the active path string doesn't exactly match what's on disk (e.g., symlinks, relative vs absolute), frozen segments could be misidentified.

5. CRC not verified on read during compaction — unlike scansegment, the compaction scan doesn't check CRCs, so a corrupted record in a frozen segment could be silently copied into the compacted segment.