Function: BloomFilter in bloom-filter/bloom_filter.py

Date: 2026-05-29

Time: 10:14

BloomFilter — Space-Efficient Probabilistic Set Membership

Purpose

BloomFilter is a classic probabilistic data structure that answers "is this item in the set?" with two possible outcomes: definitely not or probably yes. It trades a controlled false positive rate for dramatic space savings over exact set representations like hash sets. This is the foundational building block in the file — CountingBloomFilter and ScalableBloomFilter both build on top of it.

In DDIA's context, Bloom filters appear in LSM-tree storage engines (like LevelDB/RocksDB) to avoid unnecessary disk reads: before checking an SSTable on disk, the engine queries its Bloom filter to skip SSTables that definitely don't contain the key.

Contract

Preconditions:

Postconditions:

Invariants:

Parameters

| Parameter | Type | Default | Meaning |

|-----------|------|---------|---------|

| expected_items | int | 1000 | Anticipated number of distinct items. Used to size the bit array. |

| falsepositiverate | float | 0.01 | Target FPR (1%). Smaller values require more bits. |

| bitsize | int or None | None | Override: set m directly. Must be paired with numhashes. |

| numhashes | int or None | None | Override: set k directly. Must be paired with bitsize. |

Edge cases: If only one of bitsize/numhashes is provided, the other is silently ignored and the auto-sizing path runs — this is likely unintentional and not enforced.

Return Values

Algorithm

Auto-sizing (_init_):

The optimal bit array size m and hash count k for a target false positive rate p with n expected items are:


m = ceil(-n * ln(p) / (ln2)²)
k = round((m / n) * ln2)

These come from minimizing the FPR formula (1 - e^(-kn/m))^k. The implementation computes these directly, clamping both to a minimum of 1.

Hashing (_hashes):

Uses Kirschner-Mitzenmacker double hashing: a single MD5 digest is split into two 64-bit values h1 and h2, then k positions are generated as (h1 + i * h2) % m. This is a well-known technique that provides pairwise-independent hash functions from a single hash computation.

Membership test (_contains_):

Checks all k bit positions. Short-circuits on the first unset bit (definite negative). Returns True only if all k positions are set.

Union:

Requires identical m and k. Performs bitwise OR of the underlying byte arrays. The resulting _count is the sum of both counts, which is an overcount if items overlap — this is a known limitation since the filter can't determine overlap.

Cardinality estimation (estimate_count):

Uses the formula -(m/k) * ln(1 - X/m) where X is the number of set bits. This inverts the expected bit density formula to estimate how many distinct items were inserted. Returns inf when all bits are set (the log term would be undefined).

Side Effects

Error Handling

Usage Patterns


# Typical: auto-sized for expected workload
bf = BloomFilter(expected_items=10_000, false_positive_rate=0.001)
bf.add("key-123")
if "key-456" in bf:
    # probably present — do the expensive lookup
    pass

# Merging filters from different SSTable levels
merged = bf1.union(bf2)

# Serialization for persistence alongside SSTable files
data = bf.to_bytes()
restored = BloomFilter.from_bytes(data)

Callers must understand that len(bf) returns the number of add() calls, not distinct items. For distinct-item estimation, use estimate_count().

Dependencies

No external dependencies. The _hashes module-level function is tightly coupled — it's the only hash strategy and isn't pluggable.

Assumptions Not Enforced

1. Items are assumed to be strings — no type check before calling _hashes

2. Providing only one of bitsize/numhashes silently falls through to auto-sizing

3. union() sums _count which overcounts when filters share items

4. estimatedfalsepositive_rate() uses density^k which is an approximation (it assumes bit positions are independent, which becomes less true as the filter fills)

5. MD5 is used for hashing — fine for data structure purposes but not cryptographic; the choice prioritizes speed and uniform distribution

Topics to Explore

Beliefs