flush() guarantees and OS write atomicity — is a partial header possible, and what happens if fsync is not called?Date: 2026-05-29
Time: 07:28
flush() Guarantees, and Write AtomicityThe WAL uses a stop-at-first-error recovery strategy. Both recoverseqnum (wal.py:84-95) and the reader functions (read_record at wal.py:37-58) implement the same pattern:
1. Read the 4-byte length prefix. If fewer than 4 bytes are available, return None (EOF).
2. Read record_length bytes of payload. If fewer bytes are available, return None (partial record — treated as EOF, not an error).
3. Verify the CRC32 checksum. If it fails, raise ValueError.
Callers like recoverseq_num (wal.py:90-95) and replay/iterate catch the ValueError and break out of the loop, discarding everything after the corrupt record — even if valid records follow it. This is a deliberate design choice: once corruption is detected, the log's sequential ordering guarantee is broken, so no subsequent record can be trusted.
Yes, absolutely. The record format written by encoderecord (wal.py:29-34) is:
[4B record_length][4B CRC][8B seq_num][1B op_type][4B key_len][key][4B val_len][value]
This is a multi-byte structure written as a single self.fd.write(data) call (wal.py:155). On POSIX systems, write() to a regular file is not guaranteed to be atomic for buffers larger than PIPEBUF (typically 4096 bytes). Even for small writes, a crash can interrupt the operation at any byte boundary because:
1. Python's file.write() goes through the C library's stdio buffer, which may issue multiple write(2) syscalls.
2. Even a single write(2) syscall only guarantees atomicity up to PIPE_BUF for pipes — for regular files, the kernel can flush dirty pages to disk at any sub-record boundary.
3. Without fsync, the OS page cache is the only buffer. A power failure can lose any portion of a write that hasn't been flushed to the storage device, leaving a partial header on disk.
The readrecord function handles this: if fewer than 4 bytes are available for the length prefix (wal.py:40-41), or if the payload is shorter than record_length (wal.py:43-44), it returns None. This treats a partial header as a clean EOF — the record is simply lost, and recovery stops there.
However, there's a subtler scenario: a partial write that overwrites the length field with a plausible but wrong value. If a crash corrupts the 4-byte length prefix into a value that happens to be valid (e.g., it points to garbage data that still has 4+ bytes available), the reader would read record_length bytes of noise, compute a CRC, and almost certainly get a mismatch — triggering the ValueError at wal.py:55-56. The CRC acts as the safety net here.
fsync?The dosync method (wal.py:123-134) reveals three sync modes:
| Mode | Behavior | Durability |
|------|----------|------------|
| "sync" | flush() + fsync() on every append | Durable after append() returns |
| "batch" | flush() + fsync() every N writes | Last N-1 writes may be lost |
| "none" | Neither flush() nor fsync() | No durability guarantee at all |
The "none" mode (wal.py:123-134 — note the implicit else: do nothing) is the dangerous case:
1. flush() alone pushes data from Python's userspace buffer to the OS kernel page cache. Without it, data sits in the Python process's memory — a SIGKILL loses it entirely, leaving the WAL file unchanged.
2. Without fsync(), even flushed data sits in the OS page cache. A power failure can lose it. The file on disk may contain:
In modes "batch" and "none", the break-on-corruption strategy becomes the only defense against partial writes. The CRC check will catch garbage, and the short-read checks will catch truncation. But the consequence is silent data loss: records the application believed were committed are simply absent after recovery. The caller has no way to distinguish "this record was never written" from "this record was lost in a crash."
Notably, append_batch (wal.py:159-170) and checkpoint (wal.py:172-178) force sync regardless of mode (force=True), which means batch boundaries and checkpoint markers are always durable. Individual append() calls respect the configured mode, so in "batch" or "none" mode, the individual operations within a batch could theoretically be lost, but the batch COMMIT record itself is always synced. This creates an interesting asymmetry: you could lose individual puts but never a commit marker.
The WAL's corruption strategy is defensive but incomplete:
The design faithfully mirrors what DDIA describes: the WAL is only as durable as the sync policy allows, and the CRC + break strategy ensures that whatever *is* on disk is either fully valid or cleanly discarded.
write-ahead-log/wal.py:append_batch — Why batches force sync and how the COMMIT record creates an all-or-nothing recovery boundarywrite-ahead-log/wal.py:readrecord — The exact byte-level parsing logic and how each failure mode (short read, CRC mismatch) maps to a different recovery outcomeb-tree-storage-engine/btree.py — Compare the WAL journal's checksum strategy (line 132-135) with how the B-tree handles write-ahead logging for page-level crash recoverywrite-ordering-and-barriers — How the OS can reorder writes within a single write() call and why fdatasync vs fsync matters for metadata vs data durabilityhash-index-storage/bitcask.py:writerecord — Bitcask uses the same flush() + conditional fsync() pattern (line 91-93) but has no CRC — compare the failure modeswal-break-on-corruption-discards-tail — When readrecord raises ValueError (CRC mismatch), all callers break immediately, discarding any valid records that follow the corrupt one in the same filewal-partial-header-treated-as-eof — A short read of fewer than 4 bytes for the length prefix or fewer than record_length bytes for the payload returns None, indistinguishable from a normal end-of-filewal-batch-commit-always-synced — appendbatch calls do_sync(force=True), ensuring the COMMIT record is fsynced regardless of the configured sync modewal-none-mode-no-flush-no-fsync — In sync mode "none", dosync is a no-op: data may remain in Python's userspace buffer and never reach disk before a crashwal-crc-covers-content-not-header — The CRC32 is computed over optypebyte + key + value only; the length prefix, sequence number, and key length fields are not checksummed, meaning corruption in those fields is detected indirectly via downstream parse failures