File: bloom-filter/testbloomfilter.py

Date: 2026-05-29

Time: 10:15

I'll work from the test file and repository structure since the implementation file isn't accessible.

bloom-filter/testbloomfilter.py

Purpose

This is the test suite for a Bloom filter module that implements three variants: BloomFilter, CountingBloomFilter, and ScalableBloomFilter. It validates the probabilistic data structure's core guarantees — no false negatives, bounded false positives — along with serialization, set operations, and cardinality estimation. Within the DDIA reference implementations repo, it serves as the executable specification for Chapter 3's discussion of Bloom filters as used in LSM-tree storage engines (e.g., checking whether an SSTable might contain a key before doing disk I/O).

Key Components

Test groups by concern:

| Test | What it validates |

|------|-------------------|

| testaddandcontains | Basic insert + membership, len_ tracking |

| testnofalse_negatives | The fundamental Bloom filter invariant at scale (1000 items) |

| testfalsepositive_rate | Empirical FPR stays below 2× the configured target (50k probes) |

| testoptimalparameters | bitcount and hashcount match the textbook formulas: m = ⌈-n·ln(p) / ln²(2)⌉, k = round((m/n)·ln(2)) |

| testexplicitparameters | Alternate constructor path: raw bitsize + numhashes instead of probabilistic sizing |

| testcountingbloom_filter | CountingBloomFilter supports add, remove, membership, and length |

| testcountingremove_nonexistent | Removing an absent element raises ValueError |

| testcountingdoubleaddremove | Reference-counting semantics: two adds require two removes |

| testserialization | tobytes() / from_bytes() round-trip preserves bit array, parameters, and membership |

| testunion | Bitwise OR merge of two compatible filters; testunion_incompatible rejects mismatched sizes |

| testestimatecount | Cardinality estimation within 10% of actual, using the bit-density formula |

| testemptyfilter | Zero-item edge case: density 0, estimated FPR 0, count 0 |

| testduplicateadd | Duplicates increment len() — the filter counts insertions, not distinct items |

| testdeterminism | Same inputs produce identical bits arrays — the hash function is deterministic, not randomized |

| testscalablebloom_filter | ScalableBloomFilter auto-grows by adding slices when capacity is exceeded |

| testbitarraydensity / testestimated_fpr | Statistics accessors return sane values |

Patterns

1. Two constructor paths. BloomFilter accepts either (expecteditems, falsepositiverate) for auto-sizing or (bitsize, num_hashes) for explicit control. Tests cover both.

2. Empirical probabilistic testing. testfalsepositive_rate doesn't just check the formula — it runs 50,000 non-member probes and measures the actual FPR. The 2× tolerance accounts for statistical variance while catching broken hash functions.

3. Pythonic container protocol. The implementation supports in (via _contains), len() (via len), and direct attribute access (bits), making Bloom filters feel like native Python collections.

4. Variant taxonomy. Three classes form a progression:

Dependencies

Imports:

Imported by:

Flow

Tests follow a consistent pattern:

1. Construct a filter with known parameters

2. Insert a set of items (usually string-formatted: f"prefix-{i}")

3. Assert membership, non-membership, or statistical properties

4. For counting/scalable variants: exercise the additional API (remove, auto-growth)

The false-positive-rate test is the most involved: it inserts 5,000 members, then probes 50,000 distinct non-members, counts hits, and asserts the ratio is below the threshold.

Invariants

1. No false negatives — every inserted item must be found. testnofalsenegatives checks this at scale; testaddandcontains checks it at small scale.

2. FPR ≤ 2× target — the empirical false positive rate must not exceed double the configured rate.

3. Optimal sizing follows textbook formulasm and k are derived from n and p using the standard formulas, not approximations.

4. Counting filter is reference-countedadd("x") twice then remove("x") once leaves "x" present.

5. Union requires compatible parameters — filters with different bit_size cannot be unioned.

6. Cardinality estimation within 10%estimate_count() uses the bit-density-based estimator and must be reasonably accurate.

7. Deterministic hashing — identical inputs to identically-configured filters produce identical bit arrays.

Error Handling

Topics to Explore

Beliefs