Function: CountingBloomFilter in bloom-filter/bloom_filter.py

Date: 2026-05-29

Time: 07:34

CountingBloomFilter

Purpose

A standard Bloom filter is append-only — once a bit is set, you can never unset it because multiple items may share that bit position. CountingBloomFilter solves this by replacing each single bit with a multi-bit counter, enabling deletion of items while preserving the probabilistic membership guarantee. Each hash position tracks how many items map to it, so decrementing on removal is safe as long as the counter hasn't saturated.

This is the classic extension described in Fan et al. (2000) and referenced in DDIA's discussion of probabilistic data structures.

Contract

Preconditions:

Postconditions:

Invariants:

Parameters

_init_

| Parameter | Type | Default | Meaning |

|-----------|------|---------|---------|

| expected_items | int | 1000 | Anticipated number of distinct items. Used to size the counter array and choose the hash count. |

| falsepositiverate | float | 0.01 | Target FPR. Lower values → more counters (m) and more hashes (k). |

| counterbits | int | 4 | Bits per counter, controlling the saturation ceiling (2^counterbits - 1). With the default of 4, counters saturate at 15. |

Edge cases: If expecteditems is 0 or falsepositiverate is 1.0, the formulas degenerate — m could be 1 and _k could be 0 (clamped to 1). There's no validation guarding against nonsensical inputs like negative values or FPR ≥ 1.

Return Value

Algorithm

Sizing (_init_)

Uses the same optimal Bloom filter formulas as BloomFilter:

1. Counter array size m = ceil(-n * ln(p) / (ln2)²) — minimizes FPR for a given capacity.

2. Hash count k = round((m / n) * ln2) — the optimal number of hash functions.

3. Max counter value 2^counter_bits - 1 (default 15).

4. Allocates a bytearray of length m — one full byte per counter, even though only counter_bits bits are logically used. This trades ~2× space for simpler indexing.

add(item)

1. Compute k hash positions via _hashes.

2. For each position, increment the counter unless it's already at maxval (saturation guard).

3. Increment _count.

remove(item)

Two-pass approach:

1. Validation pass: Compute all k positions. If any counter is 0, raise ValueError — the item was never added (or was already removed).

2. Decrement pass: For each position, decrement the counter only if it hasn't saturated. A saturated counter stays at maxval because the true count is unknown once saturation occurred — decrementing would risk undercount.

3. Decrement _count.

_contains_(item)

Returns True iff all k hash positions have counters > 0. Identical semantics to a standard Bloom filter's membership test, just reading counters instead of bits.

Side Effects

Error Handling

Usage Patterns


cbf = CountingBloomFilter(expected_items=10000, false_positive_rate=0.001)

cbf.add("user:42")
assert "user:42" in cbf

cbf.remove("user:42")
# "user:42" in cbf is now *probably* False, but could still be True

Caller obligations:

Dependencies

Unspoken Assumptions

1. Items are strings. The _hashes function calls .encode("utf-8") — passing bytes, ints, or other types will fail at runtime with no type guard.

2. MD5 is sufficient for uniformity. MD5 is not cryptographically secure, but that's irrelevant here — what matters is uniform distribution, which MD5 provides adequately.

3. One byte per counter is acceptable. Even with counter_bits=4, each counter occupies a full byte. For large m, this is 8× the memory of a packed representation.

4. count can go negative. There's no guard against removing items that were never added when hash collisions mask the absence. The count field is a simple add/remove tally, not a true cardinality estimate.

5. No thread safety. Concurrent add/remove calls can race on counter reads and writes.