File: multi-leader-replication/multi_leader.py

Date: 2026-05-29

Time: 07:01

multi-leader-replication/multi_leader.py

Purpose

This file implements multi-leader (multi-master) replication as described in Chapter 5 of *Designing Data-Intensive Applications*. It models a cluster where multiple nodes can independently accept writes, then asynchronously replicate changes to each other — the defining characteristic that separates multi-leader from single-leader replication. The file owns conflict detection, conflict resolution, and replication topology.

Key Components

_TOMBSTONE

A sentinel object used as the stored value when a key is deleted. Using object() guarantees identity-uniqueness — no real value can ever is-match it. This is a classic soft-delete pattern: deletes are writes, not erasures, so they can replicate and resolve conflicts like any other mutation.

ConflictStrategy (Enum)

Two resolution policies:

Topology (Enum)

Controls how changes fan out during sync():

ConflictRecord (dataclass)

An audit trail entry capturing both sides of a conflict and how it was resolved. The resolvedvalue and resolvedby fields make conflicts inspectable after the fact — critical for debugging in multi-leader systems where silent data loss is the main risk.

ReplicaNode

The core abstraction. Each node maintains:

| Field | Purpose |

|-------|---------|

| _clock | Lamport scalar clock, incremented on every local write and bumped on every remote apply |

| store | dict[str, (value, timestamp, originnodeid, istombstone)] — the data, with full provenance |

| _pending | Outgoing replication log — changes queued for the next sync() |

| conflictlog | Append-only list of ConflictRecords |

| _seen | Per-key set of (timestamp, origin) pairs for idempotency |

Contract: After put/delete, the change is in store *and* in pending. After getpendingchanges(), the pending list is drained (returned and cleared). This is a consume-once queue.

MultiLeaderCluster

Orchestrator that wires nodes together. Provides the replication transport (sync()) and convergence checking (allconverged(), syncuntil_converged()). It separates the *mechanism* of conflict resolution (in ReplicaNode) from the *policy* of when and how to replicate (in MultiLeaderCluster).

Patterns

Lamport clock for causal ordering. tick() increments before every local write. applyremote_change merges with max(local, remote) + 1. This ensures the clock always advances and establishes a total order over events when combined with node ID as a tiebreaker.

Tombstone-based deletion. Deletes write a TOMBSTONE value with istombstone=True rather than removing the key. This is essential in multi-leader systems — without tombstones, a delete on node A would be indistinguishable from "never existed" when node B tries to replicate, and the value would silently reappear.

Idempotent apply. The _seen set prevents the same change from being applied twice. This matters especially in ring topology, where a change propagates through multiple hops and could revisit a node.

Deterministic tiebreaking. LWW compares (timestamp, node_id) tuples. Since Python tuple comparison is lexicographic, equal timestamps break ties by node ID string comparison. This guarantees all nodes reach the same resolution independently — no coordination needed.

Separate collection and distribution. sync() collects all pending changes *first*, then distributes. This prevents a change from cascading within a single sync round (node A's change reaching B, then B's copy reaching C in the same round).

Dependencies

Imports: Only stdlib — enum.Enum, typing, dataclasses.dataclass. No external dependencies. This is a pure in-memory simulation.

Imported by: testmultileader.py and testertestmulti_leader.py — test suites that exercise the replication and conflict resolution logic.

Flow

Write path

1. Client calls node.put(key, value) or node.delete(key)

2. Lamport clock ticks → timestamp assigned

3. Entry written to store with (value, ts, self.nodeid, is_tombstone)

4. Change dict appended to _pending

5. (ts, nodeid) recorded in seen

Replication path

1. cluster.sync() drains _pending from every node

2. Based on topology, each change is delivered to target nodes via applyremotechange

3. applyremotechange:

Convergence path

syncuntilconverged() loops sync() up to maxrounds times, checking allconverged() after each round. Convergence means all nodes agree on the same (value, timestamp, is_tombstone) for every key.

Invariants

1. Lamport clock monotonicity. clock never decreases. tick() increments by 1; applyremotechange sets max(local, remote) + 1.

2. Idempotent replication. A change with the same (key, timestamp, nodeid) is applied at most once per node, enforced by seen.

3. Tombstones are permanent until overwritten. A delete sets is_tombstone=True; get() returns None for tombstoned keys. Only a subsequent put can revive a key.

4. Pending changes are consumed exactly once. getpendingchanges() returns the list and replaces it with []. No change is sent twice from the same node (though it may be re-queued by a downstream node in ring topology).

5. Custom merge produces a new timestamp. When CUSTOMMERGE resolves a conflict, it creates newts = max(localts, remotets) + 1 and a canonicalorigin = max(localorigin, remote_node). This ensures the merged result supersedes both inputs in any subsequent LWW comparison.

6. ALLTOALL converges in one sync round (assuming no new writes during sync). Ring requires up to N-1 rounds.

Error Handling

Minimal — appropriate for a reference implementation:

Topics to Explore

Beliefs