Topic: How GC pauses and network delays create the exact window fencing tokens protect against (DDIA §8.3)

Date: 2026-05-29

Time: 13:03

How GC Pauses and Network Delays Create the Window Fencing Tokens Protect Against

The Dangerous Window

DDIA §8.3 describes a specific failure scenario that this codebase models directly. The key insight: a lock's TTL can expire while the lock holder is still alive but unresponsive — paused by GC, delayed by the network, or simply slow. The holder wakes up believing it still holds the lock and writes to shared state, corrupting data that the new holder is already modifying.

Here's how the code makes this concrete.

The Timeline, Mapped to Code

Step 1: Client A acquires the lock and gets token 1.

In LockService.acquire (fencing_tokens.py:30-37), the service issues a FencingToken with a monotonically increasing counter. Client A receives token 1 and the counter advances to 2.

Step 2: Client A enters a GC pause (or hits a network delay).

This is the critical moment the code *cannot model explicitly* — and that's the point. Nothing in the Client class (fencingtokens.py:118-153) tracks whether the client is actually responsive. The client holds a FencingToken object in self.heldtokens (line 130) and believes it's still valid. There is no heartbeat, no liveness check, no way for the client to know it was paused. The FencingToken.isexpired check (line 17) requires passing current_time — but a GC-paused process doesn't know time has passed.

Step 3: The lock expires. Client B acquires the same lock and gets token 2.

LockService.acquire (line 32-33) checks existing.isexpired(currenttime). Since currenttime >= issuedat + ttl, the lock is considered expired, and Client B gets a *new, higher* token. The counter is now 3.

Step 4: Client A wakes up and writes with its stale token.

This is where the two paths diverge — and where the tests make the argument explicit.

The Unsafe Path: testunsafescenariostalewritecorrupts (testfencing_tokens.py:107-121)


c_a.acquire_lock("lock", current_time=0, ttl=5)
unfenced.write("shared", "value", "written-by-A")
# Lock expires, B acquires
c_b.acquire_lock("lock", current_time=6, ttl=5)
unfenced.write("shared", "value", "written-by-B")
# A wakes up — stale write succeeds!
unfenced.write("shared", "value", "stale-write-by-A")

UnfencedResourceServer.write (fencing_tokens.py:104-108) accepts every write unconditionally. Client A's stale write overwrites Client B's valid data. The resource server has no way to distinguish a current holder from a zombie.

The Safe Path: testsafescenariostalewriterejected (testfencing_tokens.py:123-140)


c_a.acquire_lock("lock", current_time=0, ttl=5)
c_a.write_to_resource(fenced, "shared", "value", "written-by-A", "lock")  # token=1
# Lock expires, B acquires
c_b.acquire_lock("lock", current_time=6, ttl=5)
c_b.write_to_resource(fenced, "shared", "value", "written-by-B", "lock")  # token=2
# A tries stale write — rejected!
result = c_a.write_to_resource(fenced, "shared", "value", "stale-write-by-A", "lock")
assert result['success'] is False

FencedResourceServer.write (fencingtokens.py:83-92) tracks the highest token seen per resource (self.highesttoken). When Client A writes with token 1 after the server has already seen token 2, the comparison at line 87 (fencingtoken < highest) catches it: token 1 < 2, write rejected.

Why the Lock Alone Is Insufficient

The crucial design decision is that the lock service and the resource server are separate systems with no shared state. The lock service knows the lock expired. The resource server does not. This separation mirrors real distributed systems where the lock manager (e.g., ZooKeeper) and the storage system (e.g., a database) are independent services.

The fencing token bridges this gap by encoding the lock's causal history *into the write request itself*. The resource server doesn't need to query the lock service — it just compares integers. This is why highesttoken is tracked per-resource (line 81) and why testindependentresourcetokentracking (testfencingtokens.py:95-101) verifies that different resources maintain independent token counters.

What the Code Doesn't Show

The grep results confirm that the codebase does not model GC pauses or network delays explicitly — there are zero matches for gcpause, sleep, delay, frozen, networkdelay, or latency. The test simulates the *effect* (time jumps from 0 to 6, skipping over the TTL boundary) rather than the *cause*. This is actually a strength of the design: the fencing token mechanism works regardless of *why* the client was delayed. GC pause, network partition, CPU starvation, operator accidentally suspending the process — the protection is the same.

Topics to Explore

Beliefs