File: write-ahead-log/wal.py

Date: 2026-05-28

Time: 18:07

write-ahead-log/wal.py — Write-Ahead Log for Crash Recovery

Purpose

This file implements a Write-Ahead Log (WAL), the foundational durability primitive described in DDIA Chapter 3. A WAL ensures that every mutation is written to a sequential, append-only log on disk *before* it's applied to the main data structure. If the process crashes, the log can be replayed to reconstruct state that was acknowledged but not yet persisted.

This implementation owns: sequential record encoding/decoding, fsync-based durability guarantees, log file rotation, truncation after checkpointing, and crash-safe replay.

Key Components

Constants & Mappings

OPPUT, OPDELETE, OPCOMMIT, OPCHECKPOINT are the four operation types, stored as single-byte integers in the binary format. OPNAMES and OPBYTES provide bidirectional lookup between the integer wire format and string names used in the public API.

WALRecord (dataclass)

The deserialized form of a single log entry. Carries seqnum (monotonic ordering), optype (string name), key, value, and checksum. This is the unit of data returned by replay() and iterate().

encoderecord(seqnum, optype_byte, key, value) -> bytes

Serializes a record into a length-prefixed binary frame:


[4B record_length][4B CRC32][8B seq_num][1B op_type][4B key_len][key][4B val_len][value]

The total frame on disk is 4 (length prefix) + recordlength bytes. The CRC covers optype + key + value — notably it does not cover seq_num or lengths, so a corrupted sequence number would not be detected by the checksum.

readrecord(f) -> Optional[WALRecord]

Reads one record from a file handle. Returns None on EOF or partial read (torn write), raises ValueError on CRC mismatch. This distinction is critical: a short read means "the process crashed mid-write, ignore this tail" while a CRC failure means "data corruption in the middle of the log."

WriteAheadLog (class)

The main class. Constructor parameters:

| Parameter | Default | Purpose |

|-----------|---------|---------|

| log_dir | required | Directory for .wal files |

| sync_mode | "sync" | "sync" = fsync every write; "batch" = fsync every N writes; "none" = never fsync |

| maxfilesize | 10 MB | Rotation threshold |

| batchsynccount | 100 | Writes between fsyncs in batch mode |

Core methods:

Patterns

Length-prefixed framing. Each record starts with a 4-byte length, so the reader can skip or validate entire records without parsing internals. This is standard in log-structured storage (Kafka, LevelDB WAL, etc.).

CRC integrity checking. Every record carries a CRC32 computed over the payload. This detects bit rot and torn writes where the length was written but the body is garbage.

Monotonic sequence numbers. seqnum increments under the lock and is never reused. On recovery, recoverseq_num scans all files to find the high-water mark, so new records continue the sequence without gaps or collisions.

File rotation. When a WAL file exceeds maxfilesize, it's fsynced, closed, and a new file with an incremented numeric name is created. This bounds individual file sizes and enables efficient truncation (delete whole files rather than rewriting).

Coarse-grained locking. A single threading.Lock serializes all writes. This is simple and correct — WAL writes are inherently sequential on disk, so fine-grained locking would add complexity without improving throughput.

Sync mode trade-off. The three sync modes (sync, batch, none) let callers choose their position on the durability-vs-performance spectrum. append_batch always force-fsyncs because batch atomicity requires it — a crash between writing the operations and the COMMIT marker would leave an uncommitted batch.

Dependencies

Imports: os (file I/O, fsync), struct (binary packing), zlib (CRC32), threading (lock), plus standard library typing/dataclass support.

Imported by: testwal.py and testertest_wal.py — this is a leaf module with no downstream production consumers in this repo. It exists as a standalone reference implementation.

Flow

Write path

1. Caller invokes append() or append_batch()

2. Lock acquired

3. Sequence number(s) incremented

4. Record(s) encoded to binary via encoderecord

5. Bytes written to the open file descriptor

6. dosync() decides whether to fsync based on mode

7. mayberotate() checks file size and rotates if needed

8. Lock released, sequence number returned

Recovery path (constructor)

1. os.makedirs ensures the directory exists

2. recoverseqnum() scans every .wal file, reads all records, and finds max(seqnum) — this is O(total WAL size)

3. openlatest() opens the most recent file for appending if it's under the size limit, otherwise rotates

Replay path

1. _fd is flushed (but not fsynced — just ensuring buffered writes are visible)

2. All WAL files are scanned in sorted order via readall_records()

3. Records with seqnum <= afterseq are skipped

4. Only PUT/DELETE records are returned; COMMIT/CHECKPOINT are filtered

Truncation path

1. Current file is flushed, fsynced, and closed

2. Each WAL file is scanned; records with seqnum > upto_seq are kept

3. Empty files are deleted; files with remaining records are rewritten

4. openlatest() re-opens for future writes

Invariants

Error Handling

Topics to Explore

Beliefs