Date: 2026-05-29
Time: 10:14
BloomFilter — Space-Efficient Probabilistic Set MembershipBloomFilter is a classic probabilistic data structure that answers "is this item in the set?" with two possible outcomes: definitely not or probably yes. It trades a controlled false positive rate for dramatic space savings over exact set representations like hash sets. This is the foundational building block in the file — CountingBloomFilter and ScalableBloomFilter both build on top of it.
In DDIA's context, Bloom filters appear in LSM-tree storage engines (like LevelDB/RocksDB) to avoid unnecessary disk reads: before checking an SSTable on disk, the engine queries its Bloom filter to skip SSTables that definitely don't contain the key.
Preconditions:
add and _contains must be strings (the hashes function calls .encode("utf-8"))bitsize and numhashes (manual configuration) or neither (auto-sizing from expecteditems and falsepositive_rate)Postconditions:
add(x), x in filter is always True (no false negatives)x in filter may return True for items never added (false positive), with probability bounded by the configured rate *when item count stays at or below expected_items*Invariants:
self._bits is always ceil(m / 8) bytes longself._count tracks the number of add() calls, not the number of distinct items| Parameter | Type | Default | Meaning |
|-----------|------|---------|---------|
| expected_items | int | 1000 | Anticipated number of distinct items. Used to size the bit array. |
| falsepositiverate | float | 0.01 | Target FPR (1%). Smaller values require more bits. |
| bitsize | int or None | None | Override: set m directly. Must be paired with numhashes. |
| numhashes | int or None | None | Override: set k directly. Must be paired with bitsize. |
Edge cases: If only one of bitsize/numhashes is provided, the other is silently ignored and the auto-sizing path runs — this is likely unintentional and not enforced.
add() returns None; mutates the filter in place._contains_() returns bool — False is definitive, True is probabilistic.estimate_count() returns float — can be 0.0, inf, or an estimate using the inverse of the bit-filling formula.union() returns a new BloomFilter instance; does not mutate either operand.to_bytes() returns bytes — a 12-byte header (m, k, count as little-endian uint32s) followed by the raw bit array.Auto-sizing (_init_):
The optimal bit array size m and hash count k for a target false positive rate p with n expected items are:
m = ceil(-n * ln(p) / (ln2)²)
k = round((m / n) * ln2)
These come from minimizing the FPR formula (1 - e^(-kn/m))^k. The implementation computes these directly, clamping both to a minimum of 1.
Hashing (_hashes):
Uses Kirschner-Mitzenmacker double hashing: a single MD5 digest is split into two 64-bit values h1 and h2, then k positions are generated as (h1 + i * h2) % m. This is a well-known technique that provides pairwise-independent hash functions from a single hash computation.
Membership test (_contains_):
Checks all k bit positions. Short-circuits on the first unset bit (definite negative). Returns True only if all k positions are set.
Union:
Requires identical m and k. Performs bitwise OR of the underlying byte arrays. The resulting _count is the sum of both counts, which is an overcount if items overlap — this is a known limitation since the filter can't determine overlap.
Cardinality estimation (estimate_count):
Uses the formula -(m/k) * ln(1 - X/m) where X is the number of set bits. This inverts the expected bit density formula to estimate how many distinct items were inserted. Returns inf when all bits are set (the log term would be undefined).
add() mutates self.bits and increments self.countunion() raises ValueError if the two filters have different m or kfrom_bytes() will raise struct.error if data is shorter than 12 bytes_hashes with AttributeError on .encode()count (Python ints don't overflow, but the struct.pack("<I", ...) in tobytes will fail if count exceeds 2³² - 1)
# Typical: auto-sized for expected workload
bf = BloomFilter(expected_items=10_000, false_positive_rate=0.001)
bf.add("key-123")
if "key-456" in bf:
# probably present — do the expensive lookup
pass
# Merging filters from different SSTable levels
merged = bf1.union(bf2)
# Serialization for persistence alongside SSTable files
data = bf.to_bytes()
restored = BloomFilter.from_bytes(data)
Callers must understand that len(bf) returns the number of add() calls, not distinct items. For distinct-item estimation, use estimate_count().
hashlib.md5 — hash function (via _hashes)math — log, ceil for sizing formulasstruct — binary serialization (little-endian uint32 packing)No external dependencies. The _hashes module-level function is tightly coupled — it's the only hash strategy and isn't pluggable.
1. Items are assumed to be strings — no type check before calling _hashes
2. Providing only one of bitsize/numhashes silently falls through to auto-sizing
3. union() sums _count which overcounts when filters share items
4. estimatedfalsepositive_rate() uses density^k which is an approximation (it assumes bit positions are independent, which becomes less true as the filter fills)
5. MD5 is used for hashing — fine for data structure purposes but not cryptographic; the choice prioritizes speed and uniform distribution
bloom-filter/bloomfilter.py:hashes — The double-hashing scheme that maps items to k bit positions using a single MD5 digestbloom-filter/bloom_filter.py:ScalableBloomFilter — How the filter grows automatically by chaining slices with geometrically tightening FPR targetsbloom-filter/bloom_filter.py:CountingBloomFilter — How saturating counters enable deletion at the cost of extra spacebloom-filter/testbloomfilter.py — Test cases that verify FPR bounds, union semantics, and serialization round-tripsbloom-filter-in-lsm-trees — How storage engines like LevelDB/RocksDB attach Bloom filters to SSTables to skip unnecessary disk reads during point lookupsbloom-filter-no-false-negatives — After add(x), _contains_(x) always returns True; the filter never produces false negativesbloom-filter-double-hashing — Hash positions are generated via Kirschner-Mitzenmacker double hashing (h1 + i*h2) % m from a single MD5 digest, not k independent hash functionsbloom-filter-monotonic-bits — Bits in the filter are only ever set, never cleared; this makes the structure append-only and union-compatiblebloom-filter-count-is-insertions — count tracks the number of add() calls, not distinct items; estimatecount() uses bit density for cardinality estimation insteadbloom-filter-union-requires-identical-params — union() requires both filters to have the same m and k; it raises ValueError otherwise, since bitwise OR is only meaningful when bit positions map to the same hash space