fsync after compaction writes vs only on normal writes, since this determines whether crash recovery is actually safe or just luckyDate: 2026-05-29
Time: 08:41
This audit determines which implementations ensure compaction output is durable on disk (via fsync) versus merely flushed to OS buffers, and whether crash recovery guarantees are real or accidental.
| Implementation | Normal writes fsync? | Compaction fsync? | Crash-safe? |
|---|---|---|---|
| write-ahead-log/wal.py | Yes | N/A (no compaction) | Yes |
| b-tree-storage-engine/btree.py | Via WAL | N/A (page-based, no compaction) | Yes |
| hash-index-storage/bitcask.py | Conditional (sync_writes) | Unknown (code cut off) | Maybe |
| log-structured-merge-tree/lsm.py | No — flush only | No — flush only | No |
| log-structured-hash-table/bitcask.py | No — flush only | Unknown (code cut off) | No |
| sstable-and-compaction/sstable.py | No | Unknown (code cut off) | No |
This is the gold standard in the codebase. Every durability-critical path pairs flush() with os.fsync():
dosync, line 114-115): fsyncs in "sync" mode, batches in "batch" modeappend_batch, line 183-184): always force-fsyncs after the COMMIT record_rotate, line 127-128): fsyncs the old file before opening a new onetruncate, line 208-209): fsyncs the rewritten file after removing old recordsThe WAL implementation understands that flush() alone only moves data from Python's userspace buffer to the OS page cache — it does not guarantee the data reaches stable storage. The consistent flush() → os.fsync() pattern ensures actual durability.
The B-tree uses a two-phase approach:
PageManager.writepage() (line ~85) and writemeta() (line ~47): only call self.f.flush() — no fsync. This is intentional.PageManager.sync() (line ~111): calls flush() + os.fsync().PageManager.close() (line ~115): calls flush() + os.fsync().The safety comes from the WAL:
1. WAL.log_write() (line ~140) writes page data to the WAL and fsyncs it
2. WAL.commit() (line ~147) calls page_manager.sync() (which fsyncs the data file), then truncates the WAL and fsyncs that too
3. WAL.recover() (line ~155) replays logged writes and fsyncs
This is correct: individual page writes don't need to be durable because the WAL can replay them after a crash. The commit() call is the durability barrier.
This is the most concerning finding. The LSM tree's internal WAL (class WAL, line 13) only calls self._fd.flush() at line 26 — never os.fsync(). Grep confirms zero os.fsync calls anywhere in this file.
self._fd.flush() only_flush, line 303): writes to a new SSTable file — needs verification but no fsync appears in the grep resultscompact, line 319): merges SSTables — code was cut off at line 200 in the observation, but the complete grep shows no fsync anywhere in the fileThis means a crash at any point can lose data. The WAL is supposed to be the recovery mechanism, but since it never fsyncs, a power failure can lose WAL entries that were "flushed" to OS buffers but never reached disk. Compaction is doubly dangerous: if the process crashes mid-compaction, the new merged SSTable may be partially written (or entirely in page cache), old SSTables may already be deleted, and the WAL has already been truncated.
Verdict: crash recovery here is purely lucky — it works only because the OS eventually flushes dirty pages, and the tests never simulate actual power loss.
Normal writes in writerecord() (line ~97-99):
self.active_file.flush()
if self.sync_writes:
os.fsync(self.active_file.fileno())
The syncwrites flag (defaulting to True in init_, line 31) makes normal writes durable. But the compact() method starts at line ~213 and was cut off in the observations — we cannot confirm whether compaction output is fsynced.
writerecord() (line ~157) calls only self.activefile.flush() — no fsync, no conditional sync option. The compaction code was cut off in observations, but there are zero os.fsync calls in the grep results for this file.
SSTableWriter.finish() (line ~89) closes the file after writing the index and footer, but never calls flush() or os.fsync(). The file close will flush Python buffers but does not guarantee data reaches disk. Compaction strategy code was cut off at line 200.
The dangerous pattern is:
1. Write new compacted file (data in OS page cache only)
2. Delete old segment files
3. Crash before OS flushes the new file to disk
4. On recovery: old files gone, new file empty or corrupt
Only hash-index-storage/bitcask.py with sync_writes=True protects normal writes. The write-ahead-log/wal.py and b-tree-storage-engine/btree.py are properly durable. The LSM tree, log-structured hash table, and SSTable implementations are all vulnerable.
The compaction methods for three implementations (hash-index-storage/bitcask.py:compact, log-structured-hash-table/bitcask.py:compact, sstable-and-compaction/sstable.py compaction strategies) were cut off at line 200. A complete audit requires reading these methods to confirm whether they fsync compaction output before deleting old segments.