os.fsync which also syncs file metadata; fdatasync would be faster on Linux by skipping metadata — worth understanding the tradeoffDate: 2026-05-29
Time: 12:11
os.fsync vs fdatasync: The Metadata Sync TradeoffThis codebase uses os.fsync exclusively — 13 call sites across three modules, zero uses of fdatasync. That's a deliberate choice favoring correctness over performance, but it's worth understanding what you're paying for.
fsync actually doesos.fsync(fd) forces both the file data and file metadata (size, modification time, permissions, directory entry) from the OS page cache to durable storage. The kernel is allowed to buffer writes indefinitely; without an explicit sync, a power failure can lose data that write() already "succeeded" on.
The calls fall into three categories:
WAL durability (write-ahead-log/wal.py): The dosync method at lines 127–133 calls os.fsync after every write in "sync" mode, or every N writes in "batch" mode. The _rotate method (line 115) syncs before closing the old file. truncate (line 184) syncs before rewriting. These are the most performance-critical call sites — WAL append is on the hot path for every write operation.
B-tree page manager (b-tree-storage-engine/btree.py): PageManager.sync() at line 105 and PageManager.close() at line 113 both flush+fsync. The WAL's log_write (line 137), commit (line 144), and recover (line 171) also fsync. Every page write goes through the WAL first, so you're paying for fsync twice per mutation: once for the WAL entry, once when committing pages to the data file.
Bitcask (hash-index-storage/bitcask.py): writerecord at line 88 syncs after every record when sync_writes=True (the default). Every put and delete hits this path.
fdatasync would savefdatasync (available as os.fdatasync on Linux, not available on macOS) skips syncing file metadata unless the metadata is needed to locate the data — specifically, it still syncs the file size if it changed (because you need the size to find the end of the file), but skips things like mtime and atime.
The practical saving: one fewer disk I/O operation per sync in cases where the file size didn't change (overwriting existing pages in the B-tree) or where the metadata update can be deferred. On spinning disks, this can save a full disk revolution (~8ms). On SSDs the difference is smaller but still measurable under high throughput.
fsync is the right default hereFor a reference implementation of DDIA concepts, fsync is the correct choice:
1. Portability: os.fdatasync doesn't exist on macOS or Windows. These implementations run on all platforms without conditional logic.
2. Append-heavy workloads: The WAL (wal.py) and Bitcask (bitcask.py) are append-only. Every write extends the file, changing its size. fdatasync must sync the size change in this case, so the saving is minimal — you'd only skip the mtime update.
3. B-tree overwrites are the exception: PageManager.write_page (btree.py:80) overwrites existing pages in-place. This is the one case where fdatasync would meaningfully help — the file size doesn't change, so metadata sync is pure overhead. But the B-tree WAL writes (which are appends) still wouldn't benefit.
4. Correctness over performance: The LSM tree's WAL (log-structured-merge-tree/lsm.py) is notable for calling flush() at line 26 without any fsync at all — it's the least durable module. The other modules err on the side of too much syncing rather than too little, which is appropriate for teaching crash recovery semantics.
The bigger performance lever isn't fsync vs fdatasync — it's sync frequency. The WAL already demonstrates this with its syncmode parameter: "sync" (every write), "batch" (every N writes), or "none" (never). Batching amortizes the cost of one fsync across many writes. The Bitcask store's syncwrites boolean at construction time is the same idea. Group commit — collecting multiple writers' data and issuing one fsync for the batch — is what production databases actually do to solve this.