Date: 2026-05-29
Time: 07:34
CountingBloomFilterA standard Bloom filter is append-only — once a bit is set, you can never unset it because multiple items may share that bit position. CountingBloomFilter solves this by replacing each single bit with a multi-bit counter, enabling deletion of items while preserving the probabilistic membership guarantee. Each hash position tracks how many items map to it, so decrementing on removal is safe as long as the counter hasn't saturated.
This is the classic extension described in Fan et al. (2000) and referenced in DDIA's discussion of probabilistic data structures.
Preconditions:
add, remove, and _contains must be strings (they're passed to hashes, which calls .encode("utf-8")).remove requires the item to have been previously added — specifically, all hash positions must have non-zero counters.Postconditions:
add(x), x in filter is guaranteed True (no false negatives).remove(x), x in filter may still be True if other items share all of x's hash positions (false positive)._len_ tracks the number of add calls minus remove calls, not distinct items.Invariants:
[0, maxval]. Counters that reach maxval are saturated and never increment or decrement further.m and k are always ≥ 1._init_| Parameter | Type | Default | Meaning |
|-----------|------|---------|---------|
| expected_items | int | 1000 | Anticipated number of distinct items. Used to size the counter array and choose the hash count. |
| falsepositiverate | float | 0.01 | Target FPR. Lower values → more counters (m) and more hashes (k). |
| counterbits | int | 4 | Bits per counter, controlling the saturation ceiling (2^counterbits - 1). With the default of 4, counters saturate at 15. |
Edge cases: If expecteditems is 0 or falsepositiverate is 1.0, the formulas degenerate — m could be 1 and _k could be 0 (clamped to 1). There's no validation guarding against nonsensical inputs like negative values or FPR ≥ 1.
_contains_ returns bool — True if all counter positions are non-zero (may be a false positive)._len_ returns int — the net add/remove count, which can go negative if remove is called more times than add for a given item (the code doesn't prevent this once the first-pass zero-check passes).add and remove return None._init_)Uses the same optimal Bloom filter formulas as BloomFilter:
1. Counter array size m = ceil(-n * ln(p) / (ln2)²) — minimizes FPR for a given capacity.
2. Hash count k = round((m / n) * ln2) — the optimal number of hash functions.
3. Max counter value 2^counter_bits - 1 (default 15).
4. Allocates a bytearray of length m — one full byte per counter, even though only counter_bits bits are logically used. This trades ~2× space for simpler indexing.
add(item)1. Compute k hash positions via _hashes.
2. For each position, increment the counter unless it's already at maxval (saturation guard).
3. Increment _count.
remove(item)Two-pass approach:
1. Validation pass: Compute all k positions. If any counter is 0, raise ValueError — the item was never added (or was already removed).
2. Decrement pass: For each position, decrement the counter only if it hasn't saturated. A saturated counter stays at maxval because the true count is unknown once saturation occurred — decrementing would risk undercount.
3. Decrement _count.
_contains_(item)Returns True iff all k hash positions have counters > 0. Identical semantics to a standard Bloom filter's membership test, just reading counters instead of bits.
add and remove mutate self.counters and self.count in place.remove raises ValueError if any hash position has a zero counter. This is the only explicit error path.remove is called for an item that wasn't added but whose hash positions overlap with other items' positions (all counters > 0), the removal silently succeeds and corrupts the filter. This is inherent to the data structure — membership is probabilistic, so removal correctness is also probabilistic.maxval. A saturated counter effectively becomes "sticky," preventing accurate removal for any item that hashes to that position.
cbf = CountingBloomFilter(expected_items=10000, false_positive_rate=0.001)
cbf.add("user:42")
assert "user:42" in cbf
cbf.remove("user:42")
# "user:42" in cbf is now *probably* False, but could still be True
Caller obligations:
counterbits to avoid saturation. With counterbits=4, any position hit 15+ times is permanently stuck._hashes(item, k, m) (module-level function): Double-hashing scheme using MD5. Produces k positions in [0, m) from a single MD5 digest via h1 + i*h2 mod m. This is the Kirby-Steele technique — two hash functions simulate k independent ones.hashlib: MD5 digest in _hashes.math: log and ceil for parameter sizing.1. Items are strings. The _hashes function calls .encode("utf-8") — passing bytes, ints, or other types will fail at runtime with no type guard.
2. MD5 is sufficient for uniformity. MD5 is not cryptographically secure, but that's irrelevant here — what matters is uniform distribution, which MD5 provides adequately.
3. One byte per counter is acceptable. Even with counter_bits=4, each counter occupies a full byte. For large m, this is 8× the memory of a packed representation.
4. count can go negative. There's no guard against removing items that were never added when hash collisions mask the absence. The count field is a simple add/remove tally, not a true cardinality estimate.
5. No thread safety. Concurrent add/remove calls can race on counter reads and writes.