Topic: Neither truncate nor _flush calls fsync(), meaning data could be lost in an OS crash even after the Python-level write completes

Date: 2026-05-29

Time: 11:19

The flush() vs fsync() Durability Gap

The Core Problem

When Python calls file.flush(), it pushes data from Python's internal buffer into the operating system's page cache — but the data hasn't necessarily reached the physical disk yet. Only os.fsync(fd) forces the OS to write dirty pages to durable storage. If the machine loses power between flush() and the OS eventually writing the page to disk, that data is gone.

Where the Gap Exists

LSM Tree WAL — log-structured-merge-tree/lsm.py

This is the clearest example. The WAL's append() (line 21) writes records and calls only self._fd.flush() at line 27 — no os.fsync():


def append(self, key: str, value: bytes):
    k = key.encode("utf-8")
    self._fd.write(struct.pack(">I", len(k)))
    self._fd.write(k)
    self._fd.write(struct.pack(">I", len(value)))
    self._fd.write(value)
    self._fd.flush()          # ← Python → OS buffer only

The truncate() method (line 56) is even worse — it reopens the file to clear it, with no sync at all:


def truncate(self):
    self._fd.close()
    self._fd = open(self._path, "wb")   # truncates to zero
    self._fd.close()                     # no fsync before close
    self._fd = open(self._path, "ab")

This creates a concrete crash scenario: the LSM tree flushes its memtable to an SSTable, then calls self._wal.truncate() (line 314) to clear the WAL. If the OS has written the truncation to disk but *not* the SSTable data, the entries in that memtable are permanently lost — the WAL is empty and the SSTable is incomplete.

B-Tree PageManager — b-tree-storage-engine/btree.py

The PageManager has a similar pattern. writemeta() (line 48), writeemptyleaf() (line 55), and writepage() (line 81) all end with self._f.flush() but no fsync(). The class *does* have an explicit sync() method (line 104) and close() (line 111) that call os.fsync(), but individual page writes during normal operation don't.

This is actually intentional in the B-tree — the WAL (btree.py:137, line 137) calls os.fsync() after every logged write, and commit() calls page_manager.sync() before clearing the WAL. The PageManager defers durability to batch syncs, while the WAL guarantees recoverability. The gap is deliberate because the WAL is the durability mechanism, not the data file.

Event Store — event-sourcing-store/event_store.py

persistevent() (line 131) is the most extreme case — it opens the file in a with block, writes JSON, and relies on the implicit close() to push data out. No flush(), no fsync(). Python's close() flushes the Python buffer, but the OS can still lose the data before it hits disk.

Where It's Done Correctly

The standalone write-ahead-log module (write-ahead-log/wal.py) is the gold standard in this codebase. Its dosync() method (line 122) calls both self.fd.flush() and os.fsync(self.fd.fileno()). Every critical operation — append (via dosync), appendbatch, checkpoint, truncate (line 184), and rotate — calls fsync(). Even Bitcask (hash-index-storage/bitcask.py:88) conditionally calls os.fsync() in writerecord() when syncwrites is enabled.

Why It Matters

A write-ahead log that doesn't actually survive crashes isn't a write-ahead log — it's a write-behind-hope-for-the-best log. The entire point of WAL is that the log is durable *before* the main data structure is updated, so you can replay it after a crash. The LSM tree's WAL violates this contract: a power failure at the wrong moment loses acknowledged writes silently.

Topics to Explore

Beliefs