Date: 2026-05-29
Time: 10:15
I'll work from the test file and repository structure since the implementation file isn't accessible.
bloom-filter/testbloomfilter.pyThis is the test suite for a Bloom filter module that implements three variants: BloomFilter, CountingBloomFilter, and ScalableBloomFilter. It validates the probabilistic data structure's core guarantees — no false negatives, bounded false positives — along with serialization, set operations, and cardinality estimation. Within the DDIA reference implementations repo, it serves as the executable specification for Chapter 3's discussion of Bloom filters as used in LSM-tree storage engines (e.g., checking whether an SSTable might contain a key before doing disk I/O).
Test groups by concern:
| Test | What it validates |
|------|-------------------|
| testaddandcontains | Basic insert + membership, len_ tracking |
| testnofalse_negatives | The fundamental Bloom filter invariant at scale (1000 items) |
| testfalsepositive_rate | Empirical FPR stays below 2× the configured target (50k probes) |
| testoptimalparameters | bitcount and hashcount match the textbook formulas: m = ⌈-n·ln(p) / ln²(2)⌉, k = round((m/n)·ln(2)) |
| testexplicitparameters | Alternate constructor path: raw bitsize + numhashes instead of probabilistic sizing |
| testcountingbloom_filter | CountingBloomFilter supports add, remove, membership, and length |
| testcountingremove_nonexistent | Removing an absent element raises ValueError |
| testcountingdoubleaddremove | Reference-counting semantics: two adds require two removes |
| testserialization | tobytes() / from_bytes() round-trip preserves bit array, parameters, and membership |
| testunion | Bitwise OR merge of two compatible filters; testunion_incompatible rejects mismatched sizes |
| testestimatecount | Cardinality estimation within 10% of actual, using the bit-density formula |
| testemptyfilter | Zero-item edge case: density 0, estimated FPR 0, count 0 |
| testduplicateadd | Duplicates increment len() — the filter counts insertions, not distinct items |
| testdeterminism | Same inputs produce identical bits arrays — the hash function is deterministic, not randomized |
| testscalablebloom_filter | ScalableBloomFilter auto-grows by adding slices when capacity is exceeded |
| testbitarraydensity / testestimated_fpr | Statistics accessors return sane values |
1. Two constructor paths. BloomFilter accepts either (expecteditems, falsepositiverate) for auto-sizing or (bitsize, num_hashes) for explicit control. Tests cover both.
2. Empirical probabilistic testing. testfalsepositive_rate doesn't just check the formula — it runs 50,000 non-member probes and measures the actual FPR. The 2× tolerance accounts for statistical variance while catching broken hash functions.
3. Pythonic container protocol. The implementation supports in (via _contains), len() (via len), and direct attribute access (bits), making Bloom filters feel like native Python collections.
4. Variant taxonomy. Three classes form a progression:
BloomFilter — fixed-size, no deletionCountingBloomFilter — adds deletion via per-slot countersScalableBloomFilter — auto-grows by chaining sub-filters (_slices)Imports:
math — for verifying the optimal-parameter formulas (ln, ceil)random, string — imported but unused in the current test suite (likely leftover from an earlier version or intended for fuzz-style tests)pytest — test framework, used only for pytest.raisesbloom_filter — the module under test (BloomFilter, CountingBloomFilter, ScalableBloomFilter)Imported by:
pytest. The sibling testertestbloom_filter.py likely wraps or validates these tests as part of the project's meta-testing pattern (seen across all modules in this repo).Tests follow a consistent pattern:
1. Construct a filter with known parameters
2. Insert a set of items (usually string-formatted: f"prefix-{i}")
3. Assert membership, non-membership, or statistical properties
4. For counting/scalable variants: exercise the additional API (remove, auto-growth)
The false-positive-rate test is the most involved: it inserts 5,000 members, then probes 50,000 distinct non-members, counts hits, and asserts the ratio is below the threshold.
1. No false negatives — every inserted item must be found. testnofalsenegatives checks this at scale; testaddandcontains checks it at small scale.
2. FPR ≤ 2× target — the empirical false positive rate must not exceed double the configured rate.
3. Optimal sizing follows textbook formulas — m and k are derived from n and p using the standard formulas, not approximations.
4. Counting filter is reference-counted — add("x") twice then remove("x") once leaves "x" present.
5. Union requires compatible parameters — filters with different bit_size cannot be unioned.
6. Cardinality estimation within 10% — estimate_count() uses the bit-density-based estimator and must be reasonably accurate.
7. Deterministic hashing — identical inputs to identically-configured filters produce identical bit arrays.
ValueError on invalid removal: CountingBloomFilter.remove() raises ValueError when removing an element that was never added (tested by testcountingremove_nonexistent).ValueError on incompatible union: union() raises ValueError when bit sizes or hash counts differ (tested by testunionincompatible).bloom-filter/bloomfilter.py — The implementation: how hashes generates k hash positions, how the counting variant tracks per-slot counts, and how ScalableBloomFilter chains sub-filters with tightening FPR targetsbloom-filter/bloomfilter.py:BloomFilter.estimatecount — The cardinality estimator likely uses the formula n* = -(m/k)·ln(1 - X/m) where X is set bits; worth understanding its accuracy boundsbloom-filter/testertestbloomfilter.py — The meta-test wrapper; this repo has a testertest_*.py pattern across all modules that likely validates test quality or coveragelog-structured-merge-tree/lsm.py — LSM trees are the primary consumer of Bloom filters in DDIA; see how the filter integrates with SSTable lookups to avoid unnecessary disk readsscalable-bloom-filter-slice-growth — How ScalableBloomFilter tightens the FPR for each new slice (typically geometric: p, p·r, p·r², ...) to keep the aggregate FPR boundedbloom-filter-no-false-negatives — BloomFilter._contains_ never returns False for an item that was previously add()-ed; the test suite verifies this for up to 1000 itemsbloom-filter-len-counts-insertions — BloomFilter._len_ counts total add() calls, not distinct items — adding "dup" twice yields len() == 2counting-bloom-filter-reference-counted — CountingBloomFilter uses reference-counting semantics: an item added N times requires N remove() calls before it tests as absentbloom-filter-optimal-sizing-textbook — BloomFilter computes bitcount as ⌈-n·ln(p)/ln²(2)⌉ and hashcount as round((m/n)·ln(2)), matching the standard formulas from Bloom (1970)bloom-filter-deterministic-hashing — The hash function is deterministic (not seeded with randomness), so two filters with identical parameters produce identical bit arrays for identical inputs