File: bloom-filter/bloom_filter.py

Date: 2026-05-29

Time: 13:26

bloom-filter/bloom_filter.py

Purpose

This file implements three variants of the Bloom filter data structure, a core concept from Chapter 3 of *Designing Data-Intensive Applications*. Bloom filters are used in storage engines (LSM-trees, SSTables) to avoid expensive disk reads for keys that don't exist. The file provides a standard Bloom filter, a counting variant that supports deletion, and a scalable variant that grows automatically — covering the progression from textbook to production-grade.

Key Components

_hashes(item, k, m) — Hash Position Generator

Uses double hashing with MD5 to produce k positions in [0, m). The MD5 digest is split into two 8-byte halves interpreted as little-endian integers h1 and h2, then positions are computed as (h1 + i * h2) % m. This is the Kirschner-Mitzenmacher optimization — two hash functions simulate k independent ones. All three filter classes share this single function.

BloomFilter — Standard Probabilistic Set

The constructor accepts either semantic parameters (expecteditems, falsepositiverate) or explicit parameters (bitsize, num_hashes). When using semantic parameters, it derives optimal sizing:

The bit array is stored as a bytearray of ceil(m/8) bytes, with manual bit manipulation (pos // 8 for byte index, pos % 8 for bit offset). This is more memory-efficient than a list of booleans.

Notable methods beyond add/contains:

CountingBloomFilter — Deletion-Supporting Variant

Replaces the bit array with a bytearray of per-position counters (one byte each, but logically bounded to counter_bits — default 4 bits, max value 15). This enables remove() by decrementing counters.

Key design choice: counter saturation. When a counter hits maxval, add() stops incrementing it, and remove() won't decrement it either. This prevents counter overflow from corrupting the filter at the cost of potential false positives from permanently-set positions.

The remove() method does a two-pass check: first verifies all positions are non-zero (raising ValueError if not), then decrements. This avoids partial decrements on items that weren't added.

ScalableBloomFilter — Auto-Growing Filter

Maintains a list of BloomFilter slices. Each new slice gets geometrically larger capacity (initial_cap * growth^idx) and a tighter FPR target (p * ratio^idx). The tightening FPR ensures the aggregate false positive rate across all slices stays bounded — the geometric series p * (r^0 + r^1 + ...) converges to p / (1 - r).

Membership checks scan all slices (any()), so lookup cost grows linearly with the number of slices. This is the tradeoff for unbounded growth without rehashing.

Patterns

Dependencies

Imports: Only stdlib — hashlib (MD5 for hashing), math (log/ceil for sizing), struct (binary serialization). No external dependencies.

Imported by: testbloomfilter.py and testertestbloom_filter.py — the test suite and its validation harness.

Flow

Add path: item (string) → hashes() encodes to UTF-8, hashes with MD5, splits digest into h1/h2, generates k positions → each position sets a bit via |=count increments.

Lookup path: Same hash computation → each position checked via & → short-circuits to False on any zero bit, returns True only if all k bits are set.

Serialization round-trip: tobytes() packs header + bit array → frombytes() unpacks header, constructs filter with explicit sizing, overwrites bits and count.

Scalable growth: add() checks len(current) >= cap → if full, addslice() creates a new BloomFilter with doubled capacity and halved FPR target → item goes into the new slice.

Invariants

1. m and k are always ≥ 1 — enforced by max(1, ...) in the constructor.

2. Bit array size matches mbytearray((m + 7) // 8) allocates exactly enough bytes.

3. Union requires identical m and kValueError if mismatched, since bitwise OR is only meaningful for identically-parameterized filters.

4. Counters saturate, never overflowCountingBloomFilter.add() checks < maxval before incrementing.

5. Remove is atomic or failsCountingBloomFilter.remove() checks all positions before decrementing any.

6. Scalable FPR tightens geometrically — each slice's target FPR is p * ratio^idx, ensuring the union of all slices doesn't exceed the overall target.

Error Handling

Minimal and deliberate:

Topics to Explore

Beliefs