Function: compact in hash-index-storage/bitcask.py

Date: 2026-05-28

Time: 18:22

BitcaskStore.compact

Purpose

compact is the garbage collection mechanism for a Bitcask append-only storage engine. Because Bitcask never overwrites data in place — every put and delete appends a new record — old data files accumulate stale entries (superseded values) and tombstones (deleted keys). compact reclaims disk space by rewriting only the live records from immutable (non-active) data files into fresh files, then deleting the originals.

This directly implements the "compaction" or "merging" process described in DDIA Chapter 3's discussion of hash-indexed log-structured storage.

Contract

Preconditions:

Postconditions:

Invariants preserved:

Parameters

None — operates entirely on self state.

Return Value

None. All effects are side effects.

Algorithm

The method proceeds in six phases:

Phase 1: Identify immutable files


immutable_ids = [fid for fid in all_ids if fid != self.active_file_id]

The active file is excluded because it's still being written to. If there are no immutable files, the method returns immediately.

Phase 2: Scan immutable files for latest records

Iterates through each immutable file in ID order (oldest first), parsing every record by reading the binary header (timestamp, keysize, valuesize) followed by key and value bytes. Builds a latest dict keyed by string key:

Processing files in ascending ID order means later (newer) files naturally overwrite earlier entries, but the explicit timestamp comparison handles out-of-order cases.

Phase 3: Filter against the keydir


if entry and entry.file_id in immutable_ids:
    keys_to_compact[key] = (fid, offset, size, ts)

A key whose keydir entry points to the active file has a newer value there. The immutable copies are stale, so they're excluded from the merge output. This prevents compaction from resurrecting an old value that was overwritten in the active file.

Phase 4: Write merged files

Closes readers for old immutable files, then writes each surviving record to new merged data files:

Phase 5: Delete old immutable files

Both .data and .hint files for every old immutable file ID are removed from disk.

Phase 6: Renumber and reopen the active file

The active file must be renumbered so its ID is above the merged files (maintaining the monotonic ID invariant):

1. Closes the active file handle.

2. Renames the active data file from old ID to mergedfileid + 1.

3. Walks the keydir to update any entries that pointed to the old active file ID.

4. Reopens the active file at its new path.

Side Effects

Error Handling

There is essentially none. The method does not use try/except and makes no attempt at atomicity:

This is appropriate for a reference implementation but would be a serious concern in production.

Usage Patterns

Compaction is triggered explicitly by the caller — there's no automatic background compaction:


store = BitcaskStore("/tmp/data")
for i in range(100000):
    store.put(f"key-{i % 100}", f"value-{i}")
store.compact()  # reclaim space from ~99,900 stale entries

Callers should ensure no concurrent reads or writes during compaction. The method is idempotent in the sense that calling it when there are no immutable files is a no-op.

Dependencies

Unspoken Assumptions

1. Single-writer exclusivity: No other process or thread is reading or writing the data directory during compaction. This is not enforced by any locking mechanism.

2. File IDs never wrap or collide: The scheme of assigning activefileid + 1 as the base for merged files assumes there are no existing files at those IDs.

3. Records fit in memory: Each record is read entirely into memory. There is no streaming for large values.

4. Timestamps are monotonically increasing: The "latest wins" logic assumes timestamps from time.time() don't go backwards. Clock adjustments could violate this.

5. The active file's data file still exists on disk at oldactiveid: The rename at the end assumes no external process has moved or deleted it.