Topic: The WAL has both OPCOMMIT (for batch atomicity) and OPCHECKPOINT (for truncation safety); understanding when each is used clarifies the two distinct durability guarantees

Date: 2026-05-29

Time: 08:05

Two Durability Guarantees: Commit vs. Checkpoint

This WAL implementation provides two distinct safety markers that serve different purposes. They're easy to confuse because both are "just records in the log," but they protect against different failure modes.

OP_COMMIT — Batch Atomicity

OP_COMMIT answers the question: "Did all operations in this group make it to the log?"

It's written exclusively by appendbatch() (wal.py:155–167). The method buffers all individual operations plus a trailing COMMIT record, then writes the entire buffer in a single self.fd.write(bytes(buf)) call followed by a forced fsync:


# wal.py:155-167
def append_batch(self, operations: List[Tuple[str, str, str]]) -> int:
    with self._lock:
        buf = bytearray()
        for op_type, key, value in operations:
            self._seq_num += 1
            buf.extend(_encode_record(...))
        self._seq_num += 1
        commit_seq = self._seq_num
        buf.extend(_encode_record(commit_seq, OP_COMMIT, b"", b""))
        self._fd.write(bytes(buf))
        self._do_sync(force=True)

During replay, if the recovery code sees a PUT, PUT, DELETE sequence with no trailing COMMIT, it knows the batch was interrupted — those records are incomplete and should be discarded. The COMMIT record is the "all-or-nothing" boundary. The test at testwal.py:13–19 confirms this: a 3-operation batch produces a COMMIT at seq 7, and iterate() (testwal.py:88–92) shows the COMMIT is present as a real record in the log.

Compare this to append() (wal.py:147–154), which writes individual records with no COMMIT. Single operations are their own atomic unit — they either made it to disk or they didn't. The COMMIT marker only matters when multiple operations must succeed or fail together.

OP_CHECKPOINT — Truncation Safety

OP_CHECKPOINT answers a different question: "Up to what point has the main data store absorbed these changes?"

It's written by checkpoint() (wal.py:169–176) as a standalone record, also with a forced fsync:


# wal.py:169-176
def checkpoint(self) -> int:
    with self._lock:
        self._seq_num += 1
        seq = self._seq_num
        self._fd.write(_encode_record(seq, OP_CHECKPOINT, b"", b""))
        self._do_sync(force=True)
        return seq

The checkpoint sequence number becomes the argument to truncate() (wal.py:178+), which removes all records at or below that sequence. The test at test_wal.py:19–25 demonstrates the intended protocol:


cp_seq = wal.checkpoint()       # "main store is consistent up to here"
records = wal.replay(after_seq=cp_seq)
assert len(records) == 0        # nothing left to replay

The caller is saying: "I've flushed everything up to this point to the main store. If I crash after this, I only need to replay records *after* the checkpoint." This is what makes truncate(cp_seq) safe — you're only discarding WAL entries that the main store already reflects.

The Two Guarantees Together

| Marker | Protects against | Written by | Used during |

|--------|-----------------|------------|-------------|

| OPCOMMIT | Partial batch application | appendbatch() | Recovery/replay |

| OP_CHECKPOINT | Premature WAL truncation | checkpoint() | Truncation |

A typical lifecycle: write batches (each sealed with COMMIT) → flush to main store → write CHECKPOINT → truncate WAL up to checkpoint. COMMIT ensures each batch is atomic within the log. CHECKPOINT ensures you don't delete log entries the main store hasn't absorbed yet.

Note that replay() filters out both COMMIT and CHECKPOINT records (returning only PUT/DELETE data operations), while iterate() returns everything including markers — the test at testwal.py:88–92 asserts allrecs[3].op_type == "COMMIT" to verify this.

Topics to Explore

Beliefs