Date: 2026-05-29
Time: 07:01
multi-leader-replication/multi_leader.pyThis file implements multi-leader (multi-master) replication as described in Chapter 5 of *Designing Data-Intensive Applications*. It models a cluster where multiple nodes can independently accept writes, then asynchronously replicate changes to each other — the defining characteristic that separates multi-leader from single-leader replication. The file owns conflict detection, conflict resolution, and replication topology.
_TOMBSTONEA sentinel object used as the stored value when a key is deleted. Using object() guarantees identity-uniqueness — no real value can ever is-match it. This is a classic soft-delete pattern: deletes are writes, not erasures, so they can replicate and resolve conflicts like any other mutation.
ConflictStrategy (Enum)Two resolution policies:
LASTWRITEWINS — deterministic tiebreak on (timestamp, node_id) tuple comparison. Simple, but silently drops the losing write.CUSTOMMERGE — delegates to a caller-supplied mergefn(key, localval, remoteval, localts, remotets). This is DDIA's "application-level" resolution — the only strategy that can preserve both sides of a conflict.Topology (Enum)Controls how changes fan out during sync():
ALLTOALL — every node sends to every other node. Convergence in one round.RING — each node sends only to its successor. Requires N-1 rounds to fully propagate through an N-node cluster.ConflictRecord (dataclass)An audit trail entry capturing both sides of a conflict and how it was resolved. The resolvedvalue and resolvedby fields make conflicts inspectable after the fact — critical for debugging in multi-leader systems where silent data loss is the main risk.
ReplicaNodeThe core abstraction. Each node maintains:
| Field | Purpose |
|-------|---------|
| _clock | Lamport scalar clock, incremented on every local write and bumped on every remote apply |
| store | dict[str, (value, timestamp, originnodeid, istombstone)] — the data, with full provenance |
| _pending | Outgoing replication log — changes queued for the next sync() |
| conflictlog | Append-only list of ConflictRecords |
| _seen | Per-key set of (timestamp, origin) pairs for idempotency |
Contract: After put/delete, the change is in store *and* in pending. After getpendingchanges(), the pending list is drained (returned and cleared). This is a consume-once queue.
MultiLeaderClusterOrchestrator that wires nodes together. Provides the replication transport (sync()) and convergence checking (allconverged(), syncuntil_converged()). It separates the *mechanism* of conflict resolution (in ReplicaNode) from the *policy* of when and how to replicate (in MultiLeaderCluster).
Lamport clock for causal ordering. tick() increments before every local write. applyremote_change merges with max(local, remote) + 1. This ensures the clock always advances and establishes a total order over events when combined with node ID as a tiebreaker.
Tombstone-based deletion. Deletes write a TOMBSTONE value with istombstone=True rather than removing the key. This is essential in multi-leader systems — without tombstones, a delete on node A would be indistinguishable from "never existed" when node B tries to replicate, and the value would silently reappear.
Idempotent apply. The _seen set prevents the same change from being applied twice. This matters especially in ring topology, where a change propagates through multiple hops and could revisit a node.
Deterministic tiebreaking. LWW compares (timestamp, node_id) tuples. Since Python tuple comparison is lexicographic, equal timestamps break ties by node ID string comparison. This guarantees all nodes reach the same resolution independently — no coordination needed.
Separate collection and distribution. sync() collects all pending changes *first*, then distributes. This prevents a change from cascading within a single sync round (node A's change reaching B, then B's copy reaching C in the same round).
Imports: Only stdlib — enum.Enum, typing, dataclasses.dataclass. No external dependencies. This is a pure in-memory simulation.
Imported by: testmultileader.py and testertestmulti_leader.py — test suites that exercise the replication and conflict resolution logic.
1. Client calls node.put(key, value) or node.delete(key)
2. Lamport clock ticks → timestamp assigned
3. Entry written to store with (value, ts, self.nodeid, is_tombstone)
4. Change dict appended to _pending
5. (ts, nodeid) recorded in seen
1. cluster.sync() drains _pending from every node
2. Based on topology, each change is delivered to target nodes via applyremotechange
3. applyremotechange:
_seen for idempotency → skip if duplicatemax(local, remote) + 1(ts, node_id) winsmergefn, assign new timestamp max(localts, remotets) + 1_pending for further propagation (ring topology needs this)ConflictRecord if conflict occurredsyncuntilconverged() loops sync() up to maxrounds times, checking allconverged() after each round. Convergence means all nodes agree on the same (value, timestamp, is_tombstone) for every key.
1. Lamport clock monotonicity. clock never decreases. tick() increments by 1; applyremotechange sets max(local, remote) + 1.
2. Idempotent replication. A change with the same (key, timestamp, nodeid) is applied at most once per node, enforced by seen.
3. Tombstones are permanent until overwritten. A delete sets is_tombstone=True; get() returns None for tombstoned keys. Only a subsequent put can revive a key.
4. Pending changes are consumed exactly once. getpendingchanges() returns the list and replaces it with []. No change is sent twice from the same node (though it may be re-queued by a downstream node in ring topology).
5. Custom merge produces a new timestamp. When CUSTOMMERGE resolves a conflict, it creates newts = max(localts, remotets) + 1 and a canonicalorigin = max(localorigin, remote_node). This ensures the merged result supersedes both inputs in any subsequent LWW comparison.
6. ALLTOALL converges in one sync round (assuming no new writes during sync). Ring requires up to N-1 rounds.
Minimal — appropriate for a reference implementation:
applyremotechange raises ValueError for unknown ConflictStrategy values.syncuntilconverged raises RuntimeError if convergence isn't reached within max_rounds.put, get, or delete. Keys and values can be anything hashable/storable.mergefn is called without protection — if it raises, the exception propagates through applyremote_change and sync().multi-leader-replication/testmultileader.py — See how concurrent writes, ring convergence, and custom merge functions are exercised in testsconflict-free-replicated-data-types/crdts.py — CRDTs are the "conflict-free" alternative to the explicit conflict resolution used here; compare the trade-offsvector-clocks/vector_clock.py — Vector clocks detect causality more precisely than the scalar Lamport clock used here; understand what this implementation trades awayleader-follower-replication/replication.py:ReplicaNode — Compare single-leader replication (no conflicts by design) with this multi-leader approachring-topology-propagation — How pending re-queueing in applyremotechange enables multi-hop propagation, and why idempotency via seen is essential to prevent infinite loopsmulti-leader-lww-deterministic-tiebreak — LWW conflict resolution compares (timestamp, node_id) tuples, guaranteeing all nodes independently reach the same winner without coordinationmulti-leader-tombstone-delete — Deletes are implemented as tombstone writes that replicate like normal mutations; get() returns None for tombstoned keysmulti-leader-custom-merge-new-timestamp — Custom merge resolution creates a new timestamp (max(localts, remotets) + 1) so the merged result supersedes both conflicting inputs in any future comparisonmulti-leader-idempotent-apply — Each (key, timestamp, originnode) triple is applied at most once per node, tracked by the seen set, preventing duplicate application in ring topologiesmulti-leader-sync-collects-then-distributes — sync() drains all pending queues before distributing any changes, preventing intra-round cascading where one node's change would trigger further propagation within the same sync round