Date: 2026-05-29
Time: 09:07
total-order-broadcast/totalorderbroadcast.pyThis file implements Total Order Broadcast (TOB) — a distributed systems primitive that guarantees all nodes deliver messages in the same order. It's a reference implementation for Chapter 9 of *Designing Data-Intensive Applications*, which discusses consensus and total order broadcast as equivalent problems.
The module builds TOB on top of Multi-Paxos: each message slot is decided by an independent single-decree Paxos instance. It then demonstrates the practical payoff by building a LinearizableRegister (a linearizable key-value store) on top of the broadcast — the canonical reduction from consensus to linearizability.
ConsensusInstance (lines 4–50)A single-decree Paxos acceptor/learner for one slot in the log. Each instance tracks:
promised: highest proposal number this acceptor has promised not to accept anything lower than (Phase 1 state).acceptedproposal / acceptedvalue: the highest-numbered proposal this acceptor has accepted (Phase 2 state).decided / decided_value: whether a value has been chosen for this slot.Key contracts:
prepare(proposalnumber, proposerid) → returns {promised: bool, acceptedproposal, acceptedvalue}. Promises only if proposal_number > self.promised.accept(proposalnumber, value, proposerid) → returns {accepted: bool}. Accepts only if proposal_number >= self.promised.force_decide(value) → unconditionally marks the slot as decided (used for catch-up, not consensus).TOBNode (lines 53–278)The core protocol participant. Each node is simultaneously a Paxos proposer, acceptor, and learner for every slot. Key state:
_instances: map of slot → ConsensusInstance (created lazily)._pending: FIFO queue of messages waiting to be proposed._proposals: active Paxos rounds, keyed by slot. Each tracks round number, phase (prepare/accept/done), promise/accept vote counts, and the highest previously-accepted value (for Paxos's value-adoption rule)._delivered: ordered list of (slot, message) pairs — the final total-ordered log.nextslot: the frontier of contiguous delivery. Messages are only delivered when all prior slots are decided.Key methods:
| Method | Role |
|--------|------|
| broadcast(message) | Enqueues a message for ordering |
| tick() | Drives the protocol: delivers decided slots, starts proposals for pending messages, retries preempted proposals |
| receive(msg) | Dispatches incoming protocol messages to handlers |
| makeproposalnumber(roundnum) | Generates globally unique proposal numbers: round * N + node_id |
| findproposal_slot() | Finds lowest undecided, uncontested slot |
| bumpand_retry(slot, outgoing) | Increments the round and restarts Phase 1 after preemption |
| crash() / recover(decided_slots) | Simulates node failure and state-transfer recovery |
TOBCluster (lines 281–353)A simulated network that owns all nodes and routes messages synchronously. It provides the test harness:
rununtildelivered(expectedcount, maxrounds) — runs tick/route cycles until all alive nodes have delivered enough messages.routemessages(messages, depth) — recursively delivers messages and collects responses within a single round, with a depth limit of 500 to prevent stack overflow.verifytotalorder() — asserts all alive nodes have identical delivery sequences.LinearizableRegister (lines 356–432)Demonstrates the reduction from TOB to linearizability. Each operation (read, write, compareandset) is broadcast as a message, and the result is determined by the global delivery order. This is the key insight from DDIA: if you can order all operations consistently, you get linearizability for free.
current == expected at delivery time, not submission time.opresults keyed by opid and retrieved after rununtil_delivered returns.Multi-Paxos via single-decree composition: Rather than implementing Multi-Paxos directly with a stable leader, each slot runs independent Paxos. This is simpler but less efficient — every message triggers a full two-phase protocol.
Proposal number encoding: round * numnodes + nodeid guarantees uniqueness across nodes without coordination. Node 0's proposals are 0, 3, 6, ...; node 1's are 1, 4, 7, ... (for a 3-node cluster).
Value adoption: In handleprepare_response, if any acceptor reports a previously accepted value, the proposer *must* adopt the highest-numbered one — this is the core Paxos safety invariant that prevents decided values from being overwritten.
Re-queuing on preemption: When a proposer's value loses a slot (another value was decided), the original value is pushed back onto _pending to be proposed in a later slot. This ensures no messages are lost.
Contiguous delivery: deliverdecidedslots only advances next_slot through a contiguous run of decided slots. A gap (undecided slot 5 when 6 is decided) blocks delivery of slot 6 — this is what guarantees total order.
Synchronous simulation: TOBCluster.routemessages is recursive — it delivers all messages produced in one round before returning, simulating synchronous message passing. The depth limit prevents infinite loops during contention.
Imports: None — the module is self-contained with no external dependencies.
Imported by:
testtotalorder_broadcast.py — unit/integration teststestertesttotalorderbroadcast.py — likely a meta-test or test validatorA typical broadcast follows this path:
1. Client calls node.broadcast("msg") → message enters _pending queue.
2. tick() pops the message, finds an open slot via findproposal_slot(), starts Phase 1 by sending prepare to all nodes.
3. Each node's receive() handles the prepare → acceptor calls inst.prepare() → sends prepare_response.
4. Proposer collects promises in handleprepareresponse. On majority: adopts any previously-accepted value, transitions to Phase 2, sends acceptrequest to all.
5. Each node's acceptor handles accept → inst.accept() → sends accept_response.
6. Proposer collects accepts in handleacceptresponse. On majority: marks instance decided, broadcasts decided to all, calls deliverdecidedslots().
7. All nodes receive decided → force-decide the instance, deliver contiguous slots, re-queue any displaced proposals.
Contention path: If a prepare or accept is rejected (a higher proposal number was seen), the proposer counts rejections. Once a majority is impossible, bumpand_retry increments the round and restarts Phase 1.
nextslot advances only through consecutive decided slots. A hole blocks everything behind it.round * N + node_id encoding guarantees no two nodes generate the same proposal number._pending.N/2 + 1 votes. Two majorities always overlap, ensuring any decided value is visible to future proposers.There is essentially none — this is a protocol simulation, not production code:
TOBCluster.routemessages.crash() sets _alive = False, making the node drop all messages. recover() replays decided slots from a peer — there's no persistent storage or WAL.routemessages depth limit: Caps recursion at 500 to prevent stack overflow during heavy contention, but remaining undelivered messages are silently deferred to the next tick rather than raising.rununtildelivered timeout: Returns False if max_rounds is exhausted — callers must check the return value.