Topic: Martin Kleppmann's critique of Redis's Redlock algorithm centers on exactly the fencing token mechanism implemented here

Date: 2026-05-29

Time: 13:04

Kleppmann's Redlock Critique and the Fencing Token Implementation

The Core Argument

Kleppmann's central claim against Redlock is simple: any distributed lock that relies on timing assumptions (TTLs, clocks) can fail, and when it does, you need something else to prevent data corruption. That "something else" is a fencing token — and Redlock doesn't provide one.

This codebase implements both sides of the argument as a teaching tool.

How the Code Demonstrates the Problem

The Lock Service Issues Monotonic Tokens

In fencing-tokens/fencingtokens.py:28-40, LockService.acquire() does two things that Redlock does not: it maintains a global counter (self.counter, line 25) and attaches a monotonically increasing token to every lock grant. Each acquisition — whether for a new lock or re-acquisition after expiry — increments the counter (line 38). This is the mechanism Kleppmann argues is *necessary* and Redlock *lacks*.

The Dangerous Scenario: Process Pause After Lock Expiry

The test at fencing-tokens/testfencingtokens.py:108-126 (testunsafescenariostalewrite_corrupts) models exactly Kleppmann's attack scenario:

1. Client A acquires a lock (time=0, TTL=5)

2. Client A writes to the resource

3. Client A pauses (GC pause, network delay, page fault — any real-world delay)

4. The lock expires at time=5

5. Client B acquires the lock (time=6) and writes

6. Client A wakes up, still believing it holds the lock, and overwrites B's data

With UnfencedResourceServer (line 100), the stale write succeeds — data corruption. This is the Redlock failure mode. Redlock gives Client A a lock with a TTL, but once A's process pauses past that TTL, there is *nothing* on the storage side to reject the stale write.

The Fix: Server-Side Token Validation

The companion test at fencing-tokens/testfencingtokens.py:128-145 (testsafescenariostalewriterejected) runs the identical scenario against FencedResourceServer. The critical logic is at fencingtokens.py:80-85:


def write(self, resource, key, value, fencing_token):
    highest = self._highest_token.get(resource, 0)
    if fencing_token < highest:
        return {'success': False, 'error': f'Token {fencing_token} is stale ...'}
    self._highest_token[resource] = fencing_token

Client B's write arrives with token 2; Client A's belated write carries token 1. The server rejects it. The safety guarantee moves from the lock (which is time-dependent and fallible) to the storage server (which uses a monotonic comparison and is not time-dependent).

Why Renewal Matters

LockService.renew() at line 49 deliberately extends the TTL *without incrementing the counter*. This is a subtle design choice: renewal doesn't change the ordering relationship between lock holders. If you issued a new token on renewal, the old holder's token would become artificially "staler" even though no actual ownership change occurred. The test at testfencingtokens.py:68-79 verifies both that the token value is preserved and that the counter stays at 2 (only one acquire happened).

What This Means for Redlock

Kleppmann's argument is that Redlock provides only the lock — the left side of this implementation. It gives you TTL-based mutual exclusion with no fencing token. When the timing assumption fails (and in distributed systems, it *will* fail — GC pauses, clock skew, network delays), there is no server-side mechanism to catch the stale write. The fix requires cooperation from the storage layer (the FencedResourceServer pattern), which means the lock alone was never sufficient for correctness — only for efficiency.

The existence of UnfencedResourceServer as a separate class makes the pedagogical point explicit: the two implementations are identical except for the token check, and that single if fencing_token < highest comparison is the entire difference between data corruption and safety.

Topics to Explore

Beliefs