File: raft-consensus/raft.py

Date: 2026-05-29

Time: 09:02

raft-consensus/raft.py — Raft Consensus Algorithm Simulation

Purpose

This file is a self-contained, single-threaded simulation of the Raft consensus algorithm as described in the Ongaro & Ousterhout paper (and Chapter 9 of DDIA). It owns the core protocol logic — leader election, log replication, and commit advancement — plus a RaftCluster harness that simulates message passing and network partitions without real networking. The simulation is tick-driven: time advances in discrete steps, making the behavior deterministic enough for testing.

Key Components

LogEntry

A simple value object holding term, index, and command. No behavior — it's just a record carried in the replicated log.

RaftNode

The core state machine for a single Raft participant. Its state is organized into three categories that mirror the paper:

Key methods and their contracts:

| Method | Role | Returns |

|--------|------|---------|

| tick(elapsed_ms) | Advances timers; fires elections or heartbeats | List of outbound messages (dicts) |

| handlerequestvote(...) | Processes a vote request from a candidate | {term, vote_granted} |

| handleappendentries(...) | Processes a log replication / heartbeat RPC | {term, success, match_index} |

| handlevoteresponse(...) | Candidate tallies a vote reply | None (side effects only) |

| handleappendresponse(...) | Leader updates replication tracking | None (side effects only) |

| client_request(command) | Appends a command to the leader's log | {success, entry, error} |

| getcommittedentries() | Returns log entries at or below commitindex | List of LogEntry (excludes sentinel) |

RaftCluster

A synchronous test harness that wires nodes together. It owns message dispatch and simulates network partitions via _partitioned. Key methods:

Patterns

Tick-driven simulation. Instead of real timers or async I/O, time advances via explicit tick() calls. This makes tests deterministic (modulo random.randint for election timeouts) and avoids threading complexity.

Message-passing via dicts. Outbound messages are plain dicts with a type field ("requestvote" or "appendentries"). The cluster dispatches based on type and delivers responses inline. This is a simplified version of the RPC layer a real implementation would have.

State machine transitions. The three states (follower, candidate, leader) are managed by becomefollower, becomecandidate, and becomeleader. Every transition resets the relevant timers and volatile state. The transitions follow the Raft paper exactly: follower → candidate (election timeout), candidate → leader (majority vote), any → follower (higher term seen).

Sentinel log entry. The log is initialized with a dummy entry at index 0, term 0. This eliminates boundary checks — lastlogindex() and lastlogterm() always have a valid entry to read, and prevlogindex=0 always matches.

Dependencies

No external libraries. The entire protocol is self-contained.

Flow

Leader Election

1. A follower's tick() increments electiontimer. When it exceeds the randomized electiontimeout, the node calls becomecandidate() — increments term, votes for itself, and broadcasts request_vote messages.

2. Each recipient calls handlerequestvote(). It grants a vote if: (a) it hasn't voted this term (or already voted for this candidate), and (b) the candidate's log is at least as up-to-date (islogupto_date).

3. The cluster delivers the response back to the candidate via handlevoteresponse(). If the candidate accumulates a strict majority, it calls becomeleader().

Log Replication

1. A client calls client_request(command) on the leader, which appends a LogEntry to its local log.

2. On the next heartbeat tick, the leader sends appendentries to each peer, computed by makeappendentries(). The prevlogindex and prevlogterm fields allow the follower to verify log consistency.

3. The follower's handleappendentries() checks consistency. On mismatch, it truncates its log and returns success=False with its current matchindex so the leader can back up nextindex and retry.

4. On success, the leader updates matchindex and calls advancecommit_index(), which scans for the highest log index replicated to a majority in the current term.

Commit Advancement

advancecommitindex iterates from commitindex + 1 upward. For each index, it counts how many peers have matchindex >= n (plus the leader itself). If a majority exists and the entry's term equals currentterm, commitindex advances. The current-term check is the Raft safety property — leaders never commit entries from previous terms by counting replicas alone.

Invariants

1. Election safety. A node grants at most one vote per term — enforced by votedfor being checked in handlerequestvote and set atomically with the grant.

2. Log matching. If two logs contain an entry with the same index and term, all preceding entries are identical. Enforced by the prevlogindex/prevlogterm consistency check in handleappendentries and truncation on mismatch.

3. Leader completeness. A candidate's log must be at least as up-to-date as the voter's log to win election — islogupto_date compares term first, then index.

4. Current-term commit only. advancecommitindex skips entries where self.log[n].term != self.currentterm, preventing the "commit from a previous term" safety violation (Raft paper §5.4.2).

5. Step-down on higher term. Every RPC handler checks if term > self.currentterm: self.becomefollower(term). This is the universal "if you see a bigger term, you're stale" rule.

6. Sentinel log entry. Index 0 is always present and never removed, so lastlogindex() and lastlogterm() never fail on an empty log.

Error Handling

Errors are returned as data, never raised as exceptions:

There is no validation of inputs (e.g., no check that candidate_term is a positive integer). The code trusts its callers — appropriate for a simulation where the only caller is the cluster harness and the test suite.

Topics to Explore

Beliefs