Original source: multi-leader-replication-multi_leader

File: multi-leader-replication/multi_leader.py

Date: 2026-05-29

Time: 07:01

`multi-leader-replication/multi_leader.py`

Purpose

This file implements multi-leader (multi-master) replication as described in Chapter 5 of *Designing Data-Intensive Applications*. It models a cluster where multiple nodes can independently accept writes, then asynchronously replicate changes to each other — the defining characteristic that separates multi-leader from single-leader replication. The file owns conflict detection, conflict resolution, and replication topology.

Key Components

`_TOMBSTONE`

A sentinel object used as the stored value when a key is deleted. Using object() guarantees identity-uniqueness — no real value can ever is-match it. This is a classic soft-delete pattern: deletes are writes, not erasures, so they can replicate and resolve conflicts like any other mutation.

`ConflictStrategy` (Enum)

Two resolution policies:

LASTWRITEWINS — deterministic tiebreak on (timestamp, node_id) tuple comparison. Simple, but silently drops the losing write.
CUSTOMMERGE — delegates to a caller-supplied mergefn(key, localval, remoteval, localts, remotets). This is DDIA's "application-level" resolution — the only strategy that can preserve both sides of a conflict.

`Topology` (Enum)

Controls how changes fan out during sync():

ALLTOALL — every node sends to every other node. Convergence in one round.
RING — each node sends only to its successor. Requires N-1 rounds to fully propagate through an N-node cluster.

`ConflictRecord` (dataclass)

An audit trail entry capturing both sides of a conflict and how it was resolved. The resolvedvalue and resolvedby fields make conflicts inspectable after the fact — critical for debugging in multi-leader systems where silent data loss is the main risk.

`ReplicaNode`

The core abstraction. Each node maintains:

| Field | Purpose |

|-------|---------|

| _clock | Lamport scalar clock, incremented on every local write and bumped on every remote apply |

| store | dict[str, (value, timestamp, originnodeid, istombstone)] — the data, with full provenance |

| _pending | Outgoing replication log — changes queued for the next sync() |

| conflictlog | Append-only list of ConflictRecords |

| _seen | Per-key set of (timestamp, origin) pairs for idempotency |

Contract: After put/delete, the change is in store *and* in pending. After getpendingchanges(), the pending list is drained (returned and cleared). This is a consume-once queue.

`MultiLeaderCluster`

Orchestrator that wires nodes together. Provides the replication transport (sync()) and convergence checking (allconverged(), syncuntil_converged()). It separates the *mechanism* of conflict resolution (in ReplicaNode) from the *policy* of when and how to replicate (in MultiLeaderCluster).

Patterns

Lamport clock for causal ordering. tick() increments before every local write. applyremote_change merges with max(local, remote) + 1. This ensures the clock always advances and establishes a total order over events when combined with node ID as a tiebreaker.

Tombstone-based deletion. Deletes write a TOMBSTONE value with istombstone=True rather than removing the key. This is essential in multi-leader systems — without tombstones, a delete on node A would be indistinguishable from "never existed" when node B tries to replicate, and the value would silently reappear.

Idempotent apply. The _seen set prevents the same change from being applied twice. This matters especially in ring topology, where a change propagates through multiple hops and could revisit a node.

Deterministic tiebreaking. LWW compares (timestamp, node_id) tuples. Since Python tuple comparison is lexicographic, equal timestamps break ties by node ID string comparison. This guarantees all nodes reach the same resolution independently — no coordination needed.

Separate collection and distribution. sync() collects all pending changes *first*, then distributes. This prevents a change from cascading within a single sync round (node A's change reaching B, then B's copy reaching C in the same round).

Dependencies

Imports: Only stdlib — enum.Enum, typing, dataclasses.dataclass. No external dependencies. This is a pure in-memory simulation.

Imported by: testmultileader.py and testertestmulti_leader.py — test suites that exercise the replication and conflict resolution logic.

Flow

Write path

1. Client calls node.put(key, value) or node.delete(key)

2. Lamport clock ticks → timestamp assigned

3. Entry written to store with (value, ts, self.nodeid, is_tombstone)

4. Change dict appended to _pending

5. (ts, nodeid) recorded in seen

Replication path

1. cluster.sync() drains _pending from every node

2. Based on topology, each change is delivered to target nodes via applyremotechange

3. applyremotechange:

Checks _seen for idempotency → skip if duplicate
Merges Lamport clock: max(local, remote) + 1
If no local entry exists → accept unconditionally
If same origin + same timestamp → skip (already have it)
If different origin → conflict:
LWW: higher (ts, node_id) wins
CUSTOMMERGE: call mergefn, assign new timestamp max(localts, remotets) + 1
Appends change to _pending for further propagation (ring topology needs this)
Returns ConflictRecord if conflict occurred

Convergence path

syncuntilconverged() loops sync() up to maxrounds times, checking allconverged() after each round. Convergence means all nodes agree on the same (value, timestamp, is_tombstone) for every key.

Invariants

1. Lamport clock monotonicity. clock never decreases. tick() increments by 1; applyremotechange sets max(local, remote) + 1.

2. Idempotent replication. A change with the same (key, timestamp, nodeid) is applied at most once per node, enforced by seen.

3. Tombstones are permanent until overwritten. A delete sets is_tombstone=True; get() returns None for tombstoned keys. Only a subsequent put can revive a key.

4. Pending changes are consumed exactly once. getpendingchanges() returns the list and replaces it with []. No change is sent twice from the same node (though it may be re-queued by a downstream node in ring topology).

5. Custom merge produces a new timestamp. When CUSTOMMERGE resolves a conflict, it creates newts = max(localts, remotets) + 1 and a canonicalorigin = max(localorigin, remote_node). This ensures the merged result supersedes both inputs in any subsequent LWW comparison.

6. ALLTOALL converges in one sync round (assuming no new writes during sync). Ring requires up to N-1 rounds.

Error Handling

Minimal — appropriate for a reference implementation:

applyremotechange raises ValueError for unknown ConflictStrategy values.
syncuntilconverged raises RuntimeError if convergence isn't reached within max_rounds.
No validation on inputs to put, get, or delete. Keys and values can be anything hashable/storable.
mergefn is called without protection — if it raises, the exception propagates through applyremote_change and sync().

Topics to Explore

[file] multi-leader-replication/testmultileader.py — See how concurrent writes, ring convergence, and custom merge functions are exercised in tests
[file] conflict-free-replicated-data-types/crdts.py — CRDTs are the "conflict-free" alternative to the explicit conflict resolution used here; compare the trade-offs
[file] vector-clocks/vector_clock.py — Vector clocks detect causality more precisely than the scalar Lamport clock used here; understand what this implementation trades away
[function] leader-follower-replication/replication.py:ReplicaNode — Compare single-leader replication (no conflicts by design) with this multi-leader approach
[general] ring-topology-propagation — How pending re-queueing in applyremotechange enables multi-hop propagation, and why idempotency via seen is essential to prevent infinite loops

Beliefs

multi-leader-lww-deterministic-tiebreak — LWW conflict resolution compares (timestamp, node_id) tuples, guaranteeing all nodes independently reach the same winner without coordination
multi-leader-tombstone-delete — Deletes are implemented as tombstone writes that replicate like normal mutations; get() returns None for tombstoned keys
multi-leader-custom-merge-new-timestamp — Custom merge resolution creates a new timestamp (max(localts, remotets) + 1) so the merged result supersedes both conflicting inputs in any future comparison
multi-leader-idempotent-apply — Each (key, timestamp, originnode) triple is applied at most once per node, tracked by the seen set, preventing duplicate application in ring topologies
multi-leader-sync-collects-then-distributes — sync() drains all pending queues before distributing any changes, preventing intra-round cascading where one node's change would trigger further propagation within the same sync round