readrecord (line 57-59) interacts with OS write atomicity to detect torn writesDate: 2026-05-29
Time: 06:33
readrecordreadrecord (write-ahead-log/wal.py:37) uses two distinct mechanisms to handle incomplete writes, and they catch different failure modes:
Layer 1 — Length-prefix framing (lines 39–44): Every record is preceded by a 4-byte recordlength. On read, if fewer than 4 bytes are available (line 41) or the body is shorter than recordlength (line 44), the function returns None silently. This catches the common torn-write case: a crash interrupts write() mid-record, so the file simply ends with a truncated record. The reader treats this as EOF — no error, just "this record didn't land."
Layer 2 — CRC32 verification (lines 53–56): If the length prefix *and* the full body are present but the data is wrong, the CRC catches it. The check recomputes zlib.crc32 over optypebyte + key + value and compares against the stored CRC at the front of the record. A mismatch raises ValueError — this is treated as corruption, not a benign truncation.
POSIX write() is not atomic for arbitrarily sized buffers. The OS and disk can commit data in sector-sized chunks (typically 512B or 4KB). A crash during a multi-sector write produces a record where:
In this scenario, layer 1 sees a complete record (the length and body sizes match), so it doesn't trigger. But layer 2 catches it — the CRC was computed from the intended payload in encoderecord (line 31), and the stale tail bytes produce a different checksum.
Looking at encoderecord (line 30–31):
crc_data = struct.pack("B", op_type_byte) + key + value
crc = zlib.crc32(crc_data) & 0xFFFFFFFF
The CRC covers optype, key, and value — the semantic payload. It does not cover seqnum, keylen, or vallen. This is a pragmatic choice: if keylen or vallen are corrupted by a torn write, the reader slices the wrong bytes for key/value, which overwhelmingly produces a CRC mismatch anyway. However, a torn write that corrupts only seq_num (bytes 5–12 of the record body) would go undetected — the record would be accepted with a wrong sequence number.
The design establishes a clear contract: readrecord returns None for truncation (benign, expected at tail after crash) and raises ValueError for corruption (torn write that produced a plausible-length but wrong-content record). Recovery code upstream can use this distinction — skip tail truncation, but flag or halt on CRC errors.
The grep for torn|corrupt|truncat|partial in the WAL tests returned zero matches. There are no tests exercising the torn-write detection path, so the CRC-mismatch behavior is untested at the WAL level (though b-tree-storage-engine/test_btree.py:257 does test CRC corruption for the B-tree).
wal-crc-does-not-cover-seqnum — CRC32 in readrecord is computed over optype + key + value only; a torn write corrupting only seqnum passes validation silentlywal-truncation-vs-corruption-distinction — readrecord returns None for short reads (truncation) but raises ValueError for CRC mismatch (corruption), establishing two distinct failure modes for callerswal-record-length-prefix-guards-short-writes — The 4-byte length prefix at the start of each record allows readrecord to detect incomplete writes without reaching the CRC checkwal-no-torn-write-tests — The WAL module has no tests for truncated records, CRC mismatches, or partial writes, unlike the B-tree module which tests CRC corruption explicitly