Function: garbagecollect in snapshot-isolation/mvccdatabase.py

Date: 2026-05-29

Time: 10:09

MVCCDatabase.garbage_collect

Purpose

This is the MVCC vacuum — it reclaims storage by removing version records that no active transaction can ever read. In an MVCC system, every write creates a new Version object rather than overwriting in place, so the version list for a key grows without bound unless something prunes it. This method is that pruning pass.

It exists because snapshot isolation requires keeping old versions alive as long as any transaction might need them for its consistent snapshot. Once every transaction that could see a version has ended, the version is dead weight.

Contract

Preconditions: None enforced. Can be called at any time, even during active transactions. Does not require (or acquire) any lock — this is safe only because the implementation is single-threaded.

Postconditions:

Invariant preserved: After GC, isvisible(tx, v) still returns the same result for every remaining version v and every active transaction tx. GC never removes a version that an active transaction could read.

Parameters

None. It operates entirely on instance state.

Return Value

Returns int — the total number of Version objects removed across all keys. The caller can use this to decide whether GC was productive (e.g., log it, or skip the next GC cycle if removed == 0).

Algorithm

The method processes each key independently:

Step 1: Purge aborted versions


versions = [v for v in versions if v.created_by not in self._aborted]

Versions from aborted transactions are invisible to everyone (per isvisible), so they're unconditionally dropped. This is always safe.

Step 2a: No active transactions — aggressive compaction

When active_txs is empty, no snapshot references exist, so only the "current truth" matters.

It iterates the version list looking for the last version where:

Because it iterates forward and overwrites latest, it picks the *last* such version in list order — which is the newest, since versions are appended chronologically.

If a latest surviving version is found, everything else is discarded. If *all* versions are committed-deleted, they're all removed and the key is deleted from _versions.

Step 2b: Active transactions — conservative pruning

With active transactions, the method must preserve any version that *any* active transaction might read. The key insight is the mintxid — the oldest active transaction:


min_tx_id = min(t.tx_id for t in active_txs)

If a version was committed *and* superseded (deleted or replaced) before mintxid, then *every* active transaction sees the superseding version, not this one. It's safe to remove.

Two removal conditions:

1. Explicitly deleted: deletedby is committed and < mintx_id. The deletion is visible to all active transactions, so the deleted version is invisible to all of them.

2. Zombie (superseded): The version is committed, not explicitly deleted, but a *newer* committed version exists (also < mintxid). The newer version shadows this one for every active transaction. The any(...) check scans committedbeforeoldest to find such a superseding version.

Everything else goes into to_keep.

Step 3: Update storage

If versions remain, the list is replaced. If empty, the key is deleted from _versions entirely to avoid accumulating empty lists.

Side Effects

Mutates self.versions — removes entries and replaces version lists. Does not touch committed, aborted, transactions, or any Transaction object. This is a pure storage compaction pass.

Error Handling

None. No exceptions are raised or caught. The method assumes committed, aborted, and versions are in a consistent state. If they're not (e.g., a txid appears in both committed and aborted), the behavior is silently wrong — no invariant checks.

Usage Patterns

Typically called between transaction batches or on a periodic schedule:


db = MVCCDatabase()
# ... run some transactions, commit/abort them ...
removed = db.garbage_collect()

Caller obligations: The caller should ensure no concurrent modifications to _versions while GC runs. In this single-threaded implementation that's trivially satisfied, but it would be a problem in a concurrent setting.

GC is idempotent — calling it twice with no intervening state changes removes 0 versions the second time.

Dependencies

Relies entirely on instance state:

No external modules. No I/O.

Notable Assumptions

1. Versions are appended in tx_id order. The no-active-transactions path picks latest by iterating forward, so the last match wins. If versions were out of order, this would pick the wrong "latest."

2. committedbeforeoldest is computed but partially unused. The variable is built and used in the zombie check's any(...), but it's not used for the explicit-deletion check. This is correct but means the variable serves a narrower purpose than its name suggests.

3. deletedby semantics are overloaded. A version with deletedby set could mean either an explicit delete() call or a soft-delete from a conflicting write. GC treats both the same, which is correct.

4. Single-threaded execution. There is no locking. If garbagecollect ran concurrently with read or write, you'd get races on the versions dict and its lists.