fsync ordering between the WAL file and data file; explore what guarantees the OS and filesystem actually provideDate: 2026-05-29
Time: 10:29
A write-ahead log's correctness rests on one invariant: the WAL record must be durable on disk before the corresponding data file mutation becomes durable. If the data file change lands first and the system crashes, the WAL can't replay the operation — the data file is in a new state but the WAL doesn't know what happened to get there. If the WAL record lands first and the system crashes, recovery replays the WAL and re-applies the mutation. The order of fsync calls is what enforces this.
write-ahead-log/wal.py)The dosync method at line 127-134 is the sync decision point:
def _do_sync(self, force: bool = False):
if self._sync_mode == "sync" or force:
self._fd.flush()
os.fsync(self._fd.fileno())
elif self._sync_mode == "batch":
self._write_count += 1
if self._write_count >= self._batch_sync_count:
self._fd.flush()
os.fsync(self._fd.fileno())
self._write_count = 0
In sync mode, every append() call (line ~145) does flush() + os.fsync(). This is correct for the WAL file itself. But this WAL is a standalone module — it doesn't coordinate with any data file. The caller is responsible for the ordering: write WAL, fsync WAL, mutate data, fsync data. Nothing in this code enforces or documents that protocol.
log-structured-merge-tree/lsm.py)This is the critical problem. The WAL at line 26 does:
self._fd.flush()
No os.fsync() call. The flush() only pushes data from Python's userspace buffer into the kernel's page cache. The kernel is free to reorder or delay the actual disk write indefinitely. On a power failure, every WAL record written since the last kernel writeback is lost. This WAL provides no crash durability at all — it protects only against Python process crashes (where the kernel survives and eventually writes back its pages).
hash-index-storage/bitcask.py)Bitcask at line 95-96 does:
self.active_file.flush()
if self.sync_writes:
os.fsync(self.active_file.fileno())
Bitcask sidesteps the WAL-vs-data ordering problem entirely: the append-only data file *is* the log. There's only one file to fsync. This is architecturally clean but means Bitcask doesn't need to solve the harder problem.
fsync guarantees (POSIX)os.fsync(fd) asks the kernel to flush all modified data and metadata for that file descriptor to stable storage. POSIX says that when fsync returns, the data is durable. But there are important caveats:
1. fsync is per-file, not cross-file. Calling fsync on the WAL file guarantees the WAL is durable. It says nothing about the data file. To get ordering, you must: fsync(walfd) → write data → fsync(datafd). The two fsyncs create the ordering.
2. fsync doesn't sync the directory entry. When you create a new file (as rotate does at wal.py line 114-120), the file's contents may be durable but the *directory entry pointing to it* may not be. On ext4, if you fsync the new WAL file but not the directory, a crash can result in the file's contents existing on disk with no directory entry — the file is unreachable. The grep for dir.*fsync|fsync.*dir returned zero matches across the entire codebase. Every rotate() call, every truncation rewrite, and every SSTable creation is vulnerable to this.
3. fdatasync vs fsync. fdatasync syncs data but skips metadata (timestamps, permissions) unless the file size changed. For append workloads where the file grows, fdatasync still has to sync the size change — so the saving is minimal. For in-place overwrites it's faster. This codebase never uses fdatasync (zero matches in grep). For a WAL that's always appending, fsync is the correct choice anyway.
data=ordered (the default): Data blocks are written before the metadata transaction commits. This provides some implicit ordering — you won't see a file with stale data blocks — but it does *not* replace fsync. A crash can still lose recently-written data that was in the page cache.data=writeback: No ordering between data and metadata at all. A crash can expose stale or garbage data blocks. This makes the missing fsyncs in the LSM WAL even more dangerous.fsync on macOS traditionally did *not* flush the disk's write cache — only fcntl(fd, FFULLFSYNC) did. As of recent macOS versions, fsync behavior has improved, but FFULLFSYNC remains the only guaranteed-durable call. This codebase runs on macOS (Darwin 24.4.0) and uses os.fsync() everywhere, never F_FULLFSYNC. On macOS, none of the fsync calls may actually be durable against power loss.| Gap | Location | Impact |
|-----|----------|--------|
| LSM WAL never fsyncs | lsm.py:26 | WAL provides zero crash durability |
| No directory fsync after rotation | wal.py:114-120 | New WAL file can vanish on crash |
| No directory fsync after truncation | wal.py:183-209 (truncate rewrites files) | Rewritten files can vanish |
| No F_FULLFSYNC on macOS | All os.fsync() calls | Not truly durable on macOS |
| No cross-file ordering protocol | Entire codebase | Callers must implement WAL→data ordering themselves, undocumented |
batch and none Sync ModesThe batch mode at wal.py line 131-134 accumulates writes and only fsyncs every batchsynccount operations. Records between syncs are vulnerable. This is a legitimate performance tradeoff (group commit), but the window of vulnerability should be documented. The none mode never fsyncs at all — the WAL is purely a logical log for process crash recovery, not power failure recovery.
The standalone WAL in wal.py does the right thing for its own file in sync mode — but it doesn't solve the harder problem of cross-file ordering, and it's missing directory fsyncs for file creation. The LSM WAL in lsm.py doesn't fsync at all, making it ineffective against anything beyond a process crash. And on macOS, os.fsync() may not actually reach stable storage without F_FULLFSYNC.