Original source: bloom-filter-test_bloom_filter

File: bloom-filter/testbloomfilter.py

Date: 2026-05-29

Time: 10:15

I'll work from the test file and repository structure since the implementation file isn't accessible.

`bloom-filter/testbloomfilter.py`

Purpose

This is the test suite for a Bloom filter module that implements three variants: BloomFilter, CountingBloomFilter, and ScalableBloomFilter. It validates the probabilistic data structure's core guarantees — no false negatives, bounded false positives — along with serialization, set operations, and cardinality estimation. Within the DDIA reference implementations repo, it serves as the executable specification for Chapter 3's discussion of Bloom filters as used in LSM-tree storage engines (e.g., checking whether an SSTable might contain a key before doing disk I/O).

Key Components

Test groups by concern:

| Test | What it validates |

|------|-------------------|

| testaddandcontains | Basic insert + membership, len_ tracking |

| testnofalse_negatives | The fundamental Bloom filter invariant at scale (1000 items) |

| testfalsepositive_rate | Empirical FPR stays below 2× the configured target (50k probes) |

| testoptimalparameters | bitcount and hashcount match the textbook formulas: m = ⌈-n·ln(p) / ln²(2)⌉, k = round((m/n)·ln(2)) |

| testexplicitparameters | Alternate constructor path: raw bitsize + numhashes instead of probabilistic sizing |

| testcountingbloom_filter | CountingBloomFilter supports add, remove, membership, and length |

| testcountingremove_nonexistent | Removing an absent element raises ValueError |

| testcountingdoubleaddremove | Reference-counting semantics: two adds require two removes |

| testserialization | tobytes() / from_bytes() round-trip preserves bit array, parameters, and membership |

| testunion | Bitwise OR merge of two compatible filters; testunion_incompatible rejects mismatched sizes |

| testestimatecount | Cardinality estimation within 10% of actual, using the bit-density formula |

| testemptyfilter | Zero-item edge case: density 0, estimated FPR 0, count 0 |

| testduplicateadd | Duplicates increment len() — the filter counts insertions, not distinct items |

| testdeterminism | Same inputs produce identical bits arrays — the hash function is deterministic, not randomized |

| testscalablebloom_filter | ScalableBloomFilter auto-grows by adding slices when capacity is exceeded |

| testbitarraydensity / testestimated_fpr | Statistics accessors return sane values |

Patterns

1. Two constructor paths. BloomFilter accepts either (expecteditems, falsepositiverate) for auto-sizing or (bitsize, num_hashes) for explicit control. Tests cover both.

2. Empirical probabilistic testing. testfalsepositive_rate doesn't just check the formula — it runs 50,000 non-member probes and measures the actual FPR. The 2× tolerance accounts for statistical variance while catching broken hash functions.

3. Pythonic container protocol. The implementation supports in (via _contains), len() (via len), and direct attribute access (bits), making Bloom filters feel like native Python collections.

4. Variant taxonomy. Three classes form a progression:

BloomFilter — fixed-size, no deletion
CountingBloomFilter — adds deletion via per-slot counters
ScalableBloomFilter — auto-grows by chaining sub-filters (_slices)

Dependencies

Imports:

math — for verifying the optimal-parameter formulas (ln, ceil)
random, string — imported but unused in the current test suite (likely leftover from an earlier version or intended for fuzz-style tests)
pytest — test framework, used only for pytest.raises
bloom_filter — the module under test (BloomFilter, CountingBloomFilter, ScalableBloomFilter)

Imported by:

Nothing imports this file directly. It's executed by pytest. The sibling testertestbloom_filter.py likely wraps or validates these tests as part of the project's meta-testing pattern (seen across all modules in this repo).

Flow

Tests follow a consistent pattern:

1. Construct a filter with known parameters

2. Insert a set of items (usually string-formatted: f"prefix-{i}")

3. Assert membership, non-membership, or statistical properties

4. For counting/scalable variants: exercise the additional API (remove, auto-growth)

The false-positive-rate test is the most involved: it inserts 5,000 members, then probes 50,000 distinct non-members, counts hits, and asserts the ratio is below the threshold.

Invariants

1. No false negatives — every inserted item must be found. testnofalsenegatives checks this at scale; testaddandcontains checks it at small scale.

2. FPR ≤ 2× target — the empirical false positive rate must not exceed double the configured rate.

3. Optimal sizing follows textbook formulas — m and k are derived from n and p using the standard formulas, not approximations.

4. Counting filter is reference-counted — add("x") twice then remove("x") once leaves "x" present.

5. Union requires compatible parameters — filters with different bit_size cannot be unioned.

6. Cardinality estimation within 10% — estimate_count() uses the bit-density-based estimator and must be reasonably accurate.

7. Deterministic hashing — identical inputs to identically-configured filters produce identical bit arrays.

Error Handling

ValueError on invalid removal: CountingBloomFilter.remove() raises ValueError when removing an element that was never added (tested by testcountingremove_nonexistent).
ValueError on incompatible union: union() raises ValueError when bit sizes or hash counts differ (tested by testunionincompatible).
No error handling for type mismatches on input — the filter presumably hashes whatever it receives.

Topics to Explore

[file] bloom-filter/bloomfilter.py — The implementation: how hashes generates k hash positions, how the counting variant tracks per-slot counts, and how ScalableBloomFilter chains sub-filters with tightening FPR targets
[function] bloom-filter/bloomfilter.py:BloomFilter.estimatecount — The cardinality estimator likely uses the formula n* = -(m/k)·ln(1 - X/m) where X is set bits; worth understanding its accuracy bounds
[file] bloom-filter/testertestbloomfilter.py — The meta-test wrapper; this repo has a testertest_*.py pattern across all modules that likely validates test quality or coverage
[file] log-structured-merge-tree/lsm.py — LSM trees are the primary consumer of Bloom filters in DDIA; see how the filter integrates with SSTable lookups to avoid unnecessary disk reads
[general] scalable-bloom-filter-slice-growth — How ScalableBloomFilter tightens the FPR for each new slice (typically geometric: p, p·r, p·r², ...) to keep the aggregate FPR bounded

Beliefs

bloom-filter-no-false-negatives — BloomFilter._contains_ never returns False for an item that was previously add()-ed; the test suite verifies this for up to 1000 items
bloom-filter-len-counts-insertions — BloomFilter._len_ counts total add() calls, not distinct items — adding "dup" twice yields len() == 2
counting-bloom-filter-reference-counted — CountingBloomFilter uses reference-counting semantics: an item added N times requires N remove() calls before it tests as absent
bloom-filter-optimal-sizing-textbook — BloomFilter computes bitcount as ⌈-n·ln(p)/ln²(2)⌉ and hashcount as round((m/n)·ln(2)), matching the standard formulas from Bloom (1970)
bloom-filter-deterministic-hashing — The hash function is deterministic (not seeded with randomness), so two filters with identical parameters produce identical bit arrays for identical inputs