File: gossip-protocol/gossip_protocol.py

Date: 2026-05-29

Time: 08:55

gossip-protocol/gossip_protocol.py

Purpose

This file implements a gossip-based failure detection and membership protocol, the kind described in DDIA Chapter 8 (distributed systems) and used by systems like Cassandra and Riak. It owns two responsibilities: (1) maintaining a consistent cluster membership view across nodes via epidemic-style information dissemination, and (2) detecting node failures through heartbeat timeouts with a three-state lifecycle (alive → suspected → dead → removed).

Key Components

GossipNode

A single participant in the gossip cluster. Each node maintains its own local copy of the full membership list — there is no shared state.

Constructor accepts three timeout thresholds that govern the failure detection state machine:

Core methods:

| Method | Contract |

|--------|----------|

| join(seed_node) | Bootstraps membership by copying the seed's list and registering self with the seed. Mutates both nodes. |

| leave() | Voluntary departure — marks self dead and sets _leaving flag so the cluster can broadcast the death before deactivation. |

| heartbeat(current_time) | Increments own counter and updates timestamp. No-ops if already dead. |

| send_gossip() | Returns a deep copy of the membership list (safe to hand to another node). |

| receivegossip(membershiplist, current_time) | Merge logic — the heart of the protocol. |

| detectfailures(currenttime) | Applies timeout-based state transitions and garbage-collects old dead entries. Returns a dict of status changes. |

GossipCluster

A simulation harness that orchestrates multiple GossipNode instances. Not a distributed runtime — it's a deterministic, single-process simulator with a seeded RNG for reproducible gossip partner selection.

Key methods:

| Method | Contract |

|--------|----------|

| addnode(nodeid) | Creates a node and joins it to any existing alive node. Returns the node. |

| removenode(nodeid) | Simulates a crash — no leave message, node just stops heartbeating. |

| gossipround(currenttime) | One tick: handle leavers, heartbeat all active nodes, random pairwise exchange, then failure detection. |

| runrounds(numrounds, start_time) | Runs multiple rounds, collecting per-node membership snapshots at each tick. |

Patterns

Crux of the protocol — receive_gossip merge rule. The merge uses heartbeat counter monotonicity as the conflict resolution mechanism. A remote entry is accepted only when its heartbeat counter exceeds the local counter, which prevents stale information from overwriting fresh state. There's a special case: a dead status with an *equal* counter is also accepted, ensuring voluntary leave notifications propagate even without a heartbeat bump.

Deep-copy isolation. Every boundary crossing (sendgossip, join, getmembership_list) uses copy.deepcopy to prevent aliased mutation between nodes. This is critical for simulation correctness — without it, all nodes would share the same dict objects.

Two-phase voluntary leave. leave() doesn't immediately deactivate the node. It sets leaving = True, and gossipround broadcasts the leave message to all peers *before* setting _alive = False. This ensures at least one round of propagation.

Deterministic simulation. GossipCluster takes an optional seed for random.Random, making test scenarios reproducible despite the random peer selection.

Dependencies

Imports: Only stdlib — copy for isolation and random for peer selection. No external dependencies.

Imported by: Test files (testgossipprotocol.py, testertestgossip_protocol.py) exercise both GossipNode directly and GossipCluster for integration scenarios.

Flow

A typical simulation round (gossip_round) proceeds in four phases:

1. Leave broadcast — Any node with _leaving=True sends its membership (containing its own dead status) to every active peer, then deactivates.

2. Heartbeat — Each active node increments its own counter.

3. Pairwise exchange — Each active node picks one random peer. They exchange full membership lists bidirectionally. This is the epidemic spread mechanism — information propagates exponentially across the cluster.

4. Failure detection — Every active node scans its membership list and applies timeout-based transitions: alive → suspected (after tsuspect), suspected → dead (after tdead), dead → removed (after t_cleanup).

Information convergence relies on the random pairwise exchanges accumulating across rounds. After O(log N) rounds, all nodes will typically have consistent views.

Invariants

Error Handling

There is essentially none — this is a simulation, not production code. No exceptions are raised or caught. Invalid inputs (e.g., joining a cluster with a crashed seed node, adding a duplicate node ID) would produce incorrect state silently rather than failing loudly. The code trusts that GossipCluster drives the simulation correctly.

Topics to Explore

Beliefs