Date: 2026-05-28
Time: 18:22
BitcaskStore.compactcompact is the garbage collection mechanism for a Bitcask append-only storage engine. Because Bitcask never overwrites data in place — every put and delete appends a new record — old data files accumulate stale entries (superseded values) and tombstones (deleted keys). compact reclaims disk space by rewriting only the live records from immutable (non-active) data files into fresh files, then deleting the originals.
This directly implements the "compaction" or "merging" process described in DDIA Chapter 3's discussion of hash-indexed log-structured storage.
Preconditions:
keydir, filehandles, activefile_id, and the filesystem simultaneously.Postconditions:
keydir is updated so all entries point to either the new merged files or the (renumbered) active file.Invariants preserved:
activefileid + 1, and the active file is renumbered to mergedfileid + 1.keydir remains a consistent index: every key maps to the correct (file_id, offset, size) for its most recent live value.None — operates entirely on self state.
None. All effects are side effects.
The method proceeds in six phases:
immutable_ids = [fid for fid in all_ids if fid != self.active_file_id]
The active file is excluded because it's still being written to. If there are no immutable files, the method returns immediately.
Iterates through each immutable file in ID order (oldest first), parsing every record by reading the binary header (timestamp, keysize, valuesize) followed by key and value bytes. Builds a latest dict keyed by string key:
val_size == 0): Removes the key from latest — the delete should propagate.Processing files in ascending ID order means later (newer) files naturally overwrite earlier entries, but the explicit timestamp comparison handles out-of-order cases.
if entry and entry.file_id in immutable_ids:
keys_to_compact[key] = (fid, offset, size, ts)
A key whose keydir entry points to the active file has a newer value there. The immutable copies are stale, so they're excluded from the merge output. This prevents compaction from resurrecting an old value that was overwritten in the active file.
Closes readers for old immutable files, then writes each surviving record to new merged data files:
activefileid + 1 to avoid collisions.maxfilesize, just like active files.keydir is updated in-place to point to the new locations.Both .data and .hint files for every old immutable file ID are removed from disk.
The active file must be renumbered so its ID is above the merged files (maintaining the monotonic ID invariant):
1. Closes the active file handle.
2. Renames the active data file from old ID to mergedfileid + 1.
3. Walks the keydir to update any entries that pointed to the old active file ID.
4. Reopens the active file at its new path.
.data and .hint files; deletes old .data and .hint files; renames the active data file.self.keydir (updates fileid/offset/size for compacted keys and active-file keys), self.filehandles (closes old, opens new), self.activefileid, and self.active_file.There is essentially none. The method does not use try/except and makes no attempt at atomicity:
keydir entries pointing to deleted files.This is appropriate for a reference implementation but would be a serious concern in production.
Compaction is triggered explicitly by the caller — there's no automatic background compaction:
store = BitcaskStore("/tmp/data")
for i in range(100000):
store.put(f"key-{i % 100}", f"value-{i}")
store.compact() # reclaim space from ~99,900 stale entries
Callers should ensure no concurrent reads or writes during compaction. The method is idempotent in the sense that calling it when there are no immutable files is a no-op.
os — file operations (path, size, rename, remove, makedirs)struct — binary record parsing/packing using HEADERFORMAT and HINTFORMATKeyEntry — dataclass for keydir entriesfindfileids, datapath, hintpath, getreader, writehintfile, openactive_file1. Single-writer exclusivity: No other process or thread is reading or writing the data directory during compaction. This is not enforced by any locking mechanism.
2. File IDs never wrap or collide: The scheme of assigning activefileid + 1 as the base for merged files assumes there are no existing files at those IDs.
3. Records fit in memory: Each record is read entirely into memory. There is no streaming for large values.
4. Timestamps are monotonically increasing: The "latest wins" logic assumes timestamps from time.time() don't go backwards. Clock adjustments could violate this.
5. The active file's data file still exists on disk at oldactiveid: The rename at the end assumes no external process has moved or deleted it.