Original source: topic-phi-accrual-failure-detector

Topic: How production systems (Cassandra) replace the fixed-threshold approach here with a probabilistic one that adapts to network conditions

Date: 2026-05-29

Time: 13:07

From Fixed Thresholds to Probabilistic Failure Detection

The Fixed-Threshold Approach in This Codebase

The gossip protocol implementation uses a simple three-threshold state machine for failure detection. Look at gossip-protocol/gossip_protocol.py:10:


def __init__(self, node_id: str, t_suspect: int = 5, t_dead: int = 10, t_cleanup: int = 20):

These three constants define rigid time boundaries. The actual decision logic lives in detect_failures at line 76, which compares elapsed time since the last heartbeat update against these fixed cutoffs:


elif elapsed > self.t_dead and info["status"] != "dead":       # line 93
    info["status"] = "dead"
elif elapsed > self.t_suspect and info["status"] == "alive":   # line 96
    info["status"] = "suspected"

The state transitions are deterministic: alive → suspected → dead → removed. Every node uses identical thresholds (tsuspect=5, tdead=10, tcleanup=20), and these never change at runtime. You can see this rigidity propagated through GossipCluster.init at line 117, where the same values are stamped onto every node created via addnode at line 127.

Why Fixed Thresholds Break in Production

This approach has a fundamental problem: it can't distinguish between "the network is slow right now" and "that node is actually dead."

Consider a data center experiencing a GC pause or a network partition that adds 2 seconds of latency. With t_suspect=5, a node whose heartbeats normally arrive every 1 second will be marked suspected after just 3 missed heartbeats — even though it's perfectly healthy and the network will recover in moments. Conversely, in a fast, stable network, waiting a full 5 seconds before suspecting a node wastes time you could spend re-routing traffic.

The tests in gossip-protocol/testertestgossipprotocol.py:20 confirm this brittle coupling — they must hardcode specific threshold values (tsuspect=3, tdead=6, tcleanup=12) and then carefully count rounds to hit the exact transition points.

How Cassandra Solves This: The Phi Accrual Failure Detector

Cassandra replaces the binary "alive or suspected" check with a continuous suspicion level called phi (φ). Instead of asking "has it been longer than t_suspect?", it asks "given the historical distribution of inter-arrival times for this node's heartbeats, how surprising is the current silence?"

The key differences:

1. It maintains a sliding window of heartbeat inter-arrival times per peer. Where this implementation tracks only the latest timestamp (timestamplastupdated at line 16) and a monotonic counter (heartbeat_counter), Cassandra keeps the last ~1000 inter-arrival intervals and fits them to an exponential (or normal) distribution.

2. Suspicion is a continuous value, not a binary state. The phi value is calculated as:


φ = -log₁₀(1 - CDF(timeSinceLastHeartbeat))

Where CDF is the cumulative distribution function of the observed inter-arrival times. A φ of 1 means there's a 10% chance this silence is normal. A φ of 3 means 0.1%. A φ of 8 means the probability of a healthy node being this quiet is one in a hundred million.

3. The threshold adapts automatically. A node with jittery heartbeats (high variance) gets a wider distribution, so the same silence duration produces a lower φ. A node with rock-steady heartbeats produces a high φ quickly when it goes quiet — because silence is genuinely surprising.

4. Different callers can use different conviction thresholds. The gossip protocol might act at φ > 8 (very conservative), while a read-repair mechanism might start rerouting at φ > 5. In the fixed-threshold model, there's only one t_suspect for everyone.

What this would look like replacing `detect_failures`

Instead of the current logic at lines 86–98 where elapsed > self.t_suspect triggers a state change, a phi-accrual detector would:

1. Record every heartbeat arrival time in a bounded window (not just overwrite timestamplastupdated)

2. Compute mean and variance of inter-arrival intervals

3. Calculate φ from the elapsed time and that distribution

4. Let the caller decide what φ level warrants action

The receivegossip method (line 52) would need to feed arrival times into the distribution rather than just updating timestamplastupdated at line 62. The three fixed thresholds (tsuspect, tdead, tcleanup) would collapse into a single configurable conviction level (e.g., phiconvictthreshold = 8), but the actual sensitivity would be driven by observed network behavior.

What this codebase is missing to implement it

This implementation has no concept of inter-arrival time history — it only stores the most recent timestamp. There's no statistical modeling, no per-peer distribution tracking, and no continuous suspicion score. The entire failure detection path is a simple elapsed-time comparison against constants. That's the pedagogical point: this code teaches the protocol structure (gossip exchange, state machine, membership management), while Cassandra layers statistical sophistication on top of that same structure.