truncate nor _flush calls fsync(), meaning data could be lost in an OS crash even after the Python-level write completesDate: 2026-05-29
Time: 11:19
flush() vs fsync() Durability GapWhen Python calls file.flush(), it pushes data from Python's internal buffer into the operating system's page cache — but the data hasn't necessarily reached the physical disk yet. Only os.fsync(fd) forces the OS to write dirty pages to durable storage. If the machine loses power between flush() and the OS eventually writing the page to disk, that data is gone.
log-structured-merge-tree/lsm.pyThis is the clearest example. The WAL's append() (line 21) writes records and calls only self._fd.flush() at line 27 — no os.fsync():
def append(self, key: str, value: bytes):
k = key.encode("utf-8")
self._fd.write(struct.pack(">I", len(k)))
self._fd.write(k)
self._fd.write(struct.pack(">I", len(value)))
self._fd.write(value)
self._fd.flush() # ← Python → OS buffer only
The truncate() method (line 56) is even worse — it reopens the file to clear it, with no sync at all:
def truncate(self):
self._fd.close()
self._fd = open(self._path, "wb") # truncates to zero
self._fd.close() # no fsync before close
self._fd = open(self._path, "ab")
This creates a concrete crash scenario: the LSM tree flushes its memtable to an SSTable, then calls self._wal.truncate() (line 314) to clear the WAL. If the OS has written the truncation to disk but *not* the SSTable data, the entries in that memtable are permanently lost — the WAL is empty and the SSTable is incomplete.
b-tree-storage-engine/btree.pyThe PageManager has a similar pattern. writemeta() (line 48), writeemptyleaf() (line 55), and writepage() (line 81) all end with self._f.flush() but no fsync(). The class *does* have an explicit sync() method (line 104) and close() (line 111) that call os.fsync(), but individual page writes during normal operation don't.
This is actually intentional in the B-tree — the WAL (btree.py:137, line 137) calls os.fsync() after every logged write, and commit() calls page_manager.sync() before clearing the WAL. The PageManager defers durability to batch syncs, while the WAL guarantees recoverability. The gap is deliberate because the WAL is the durability mechanism, not the data file.
event-sourcing-store/event_store.pypersistevent() (line 131) is the most extreme case — it opens the file in a with block, writes JSON, and relies on the implicit close() to push data out. No flush(), no fsync(). Python's close() flushes the Python buffer, but the OS can still lose the data before it hits disk.
The standalone write-ahead-log module (write-ahead-log/wal.py) is the gold standard in this codebase. Its dosync() method (line 122) calls both self.fd.flush() and os.fsync(self.fd.fileno()). Every critical operation — append (via dosync), appendbatch, checkpoint, truncate (line 184), and rotate — calls fsync(). Even Bitcask (hash-index-storage/bitcask.py:88) conditionally calls os.fsync() in writerecord() when syncwrites is enabled.
A write-ahead log that doesn't actually survive crashes isn't a write-ahead log — it's a write-behind-hope-for-the-best log. The entire point of WAL is that the log is durable *before* the main data structure is updated, so you can replay it after a crash. The LSM tree's WAL violates this contract: a power failure at the wrong moment loses acknowledged writes silently.
log-structured-merge-tree/lsm.py:_flush — The memtable-to-SSTable flush path; check whether SSTable writes also lack fsync, compounding the durability gapwrite-ahead-log/wal.py:dosync — The correct fsync pattern with configurable sync modes (sync vs batch) that the LSM WAL should emulateb-tree-storage-engine/btree.py:commit — How the B-tree WAL coordinates fsync ordering between the WAL and the data file to ensure crash safetyfdatasync-vs-fsync — fdatasync() skips metadata updates and is faster on Linux; relevant for high-write workloads where the LSM WAL sync gap is most painfulevent-sourcing-store/event_store.py — The event store's persistence has no durability guarantees at all; contrast with the WAL module's approachlsm-wal-no-fsync — The LSM tree's WAL (log-structured-merge-tree/lsm.py) calls flush() but never os.fsync(), so acknowledged writes can be lost on OS crashlsm-truncate-before-sstable-sync — The LSM tree truncates the WAL after flushing memtable to SSTable, but neither operation calls fsync, creating a window where both the WAL and SSTable data exist only in OS page cachewal-module-syncs-every-write — The standalone WAL module (write-ahead-log/wal.py) calls os.fsync() after every append in sync mode and every N writes in batch mode, making it the only WAL implementation with real crash durabilitybtree-wal-fsync-pagefile-deferred — The B-tree's WAL calls fsync() on every log entry, but the PageManager defers fsync to explicit sync() or close() calls — a deliberate design where the WAL provides crash recovery for un-synced page writesevent-store-persist-no-durability — EventStore.persistevent() writes to disk with no flush() or fsync(), relying entirely on OS-level buffering and implicit close behavior