File: total-order-broadcast/totalorderbroadcast.py

Date: 2026-05-29

Time: 09:07

total-order-broadcast/totalorderbroadcast.py

Purpose

This file implements Total Order Broadcast (TOB) — a distributed systems primitive that guarantees all nodes deliver messages in the same order. It's a reference implementation for Chapter 9 of *Designing Data-Intensive Applications*, which discusses consensus and total order broadcast as equivalent problems.

The module builds TOB on top of Multi-Paxos: each message slot is decided by an independent single-decree Paxos instance. It then demonstrates the practical payoff by building a LinearizableRegister (a linearizable key-value store) on top of the broadcast — the canonical reduction from consensus to linearizability.

Key Components

ConsensusInstance (lines 4–50)

A single-decree Paxos acceptor/learner for one slot in the log. Each instance tracks:

Key contracts:

TOBNode (lines 53–278)

The core protocol participant. Each node is simultaneously a Paxos proposer, acceptor, and learner for every slot. Key state:

Key methods:

| Method | Role |

|--------|------|

| broadcast(message) | Enqueues a message for ordering |

| tick() | Drives the protocol: delivers decided slots, starts proposals for pending messages, retries preempted proposals |

| receive(msg) | Dispatches incoming protocol messages to handlers |

| makeproposalnumber(roundnum) | Generates globally unique proposal numbers: round * N + node_id |

| findproposal_slot() | Finds lowest undecided, uncontested slot |

| bumpand_retry(slot, outgoing) | Increments the round and restarts Phase 1 after preemption |

| crash() / recover(decided_slots) | Simulates node failure and state-transfer recovery |

TOBCluster (lines 281–353)

A simulated network that owns all nodes and routes messages synchronously. It provides the test harness:

LinearizableRegister (lines 356–432)

Demonstrates the reduction from TOB to linearizability. Each operation (read, write, compareandset) is broadcast as a message, and the result is determined by the global delivery order. This is the key insight from DDIA: if you can order all operations consistently, you get linearizability for free.

Patterns

Multi-Paxos via single-decree composition: Rather than implementing Multi-Paxos directly with a stable leader, each slot runs independent Paxos. This is simpler but less efficient — every message triggers a full two-phase protocol.

Proposal number encoding: round * numnodes + nodeid guarantees uniqueness across nodes without coordination. Node 0's proposals are 0, 3, 6, ...; node 1's are 1, 4, 7, ... (for a 3-node cluster).

Value adoption: In handleprepare_response, if any acceptor reports a previously accepted value, the proposer *must* adopt the highest-numbered one — this is the core Paxos safety invariant that prevents decided values from being overwritten.

Re-queuing on preemption: When a proposer's value loses a slot (another value was decided), the original value is pushed back onto _pending to be proposed in a later slot. This ensures no messages are lost.

Contiguous delivery: deliverdecidedslots only advances next_slot through a contiguous run of decided slots. A gap (undecided slot 5 when 6 is decided) blocks delivery of slot 6 — this is what guarantees total order.

Synchronous simulation: TOBCluster.routemessages is recursive — it delivers all messages produced in one round before returning, simulating synchronous message passing. The depth limit prevents infinite loops during contention.

Dependencies

Imports: None — the module is self-contained with no external dependencies.

Imported by:

Flow

A typical broadcast follows this path:

1. Client calls node.broadcast("msg") → message enters _pending queue.

2. tick() pops the message, finds an open slot via findproposal_slot(), starts Phase 1 by sending prepare to all nodes.

3. Each node's receive() handles the prepare → acceptor calls inst.prepare() → sends prepare_response.

4. Proposer collects promises in handleprepareresponse. On majority: adopts any previously-accepted value, transitions to Phase 2, sends acceptrequest to all.

5. Each node's acceptor handles accept → inst.accept() → sends accept_response.

6. Proposer collects accepts in handleacceptresponse. On majority: marks instance decided, broadcasts decided to all, calls deliverdecidedslots().

7. All nodes receive decided → force-decide the instance, deliver contiguous slots, re-queue any displaced proposals.

Contention path: If a prepare or accept is rejected (a higher proposal number was seen), the proposer counts rejections. Once a majority is impossible, bumpand_retry increments the round and restarts Phase 1.

Invariants

Error Handling

There is essentially none — this is a protocol simulation, not production code: