Topic: os.fdatasync() skips inode metadata (mtime/size) and is faster; explore where the codebase could use it instead of os.fsync() without sacrificing correctness

Date: 2026-05-29

Time: 06:38

os.fdatasync() vs os.fsync() in the DDIA Implementations

The Core Difference

os.fsync(fd) flushes both file data and all inode metadata (mtime, ctime, size, permissions) to stable storage. os.fdatasync(fd) flushes file data plus only the metadata needed to retrieve that data — specifically, it updates size if the file grew, but skips mtime/ctime/permissions. On Linux, when you overwrite existing bytes without extending the file, fdatasync avoids the metadata write entirely — that's where the real speedup lives.

The 13 fsync Call Sites

The codebase has 13 os.fsync() calls and zero os.fdatasync() calls. Here's the breakdown by safety of replacement:

Safe to replace — and a meaningful win

B-tree in-place page overwrites (b-tree-storage-engine/btree.py):

Safe to replace — modest win

These are all append-only writes where the file size changes. fdatasync still flushes the size update (it must, to read the data back correctly), but it skips mtime/ctime — a small saving per call:

| File | Line | Context |

|------|------|---------|

| write-ahead-log/wal.py | 115 | _rotate() — flush before closing old WAL segment |

| write-ahead-log/wal.py | 128 | dosync() — per-record sync in "sync" mode |

| write-ahead-log/wal.py | 133 | dosync() — batch sync when count threshold met |

| write-ahead-log/wal.py | 184 | truncate() — flush before closing for rewrite |

| write-ahead-log/wal.py | 209 | truncate() — sync the rewritten file |

| write-ahead-log/wal.py | 261 | close() — final flush |

| b-tree-storage-engine/btree.py | 137 | WAL.log_write() — append WAL entry |

| hash-index-storage/bitcask.py | 88 | writerecord() — append to active data file |

Safe but needs care

Where the Win Is Largest

The B-tree PageManager is the standout candidate. Here's why:

1. B-trees do in-place overwrites of fixed-size pages — the file size doesn't change for most writes

2. PageManager.sync() at line 105 is called on every committed write path

3. Skipping the metadata write means one fewer disk seek per sync on rotational media, and one fewer journal entry on ext4

For the WAL and Bitcask modules, the savings are real but smaller — they append, so the size metadata must be flushed regardless.

A Notable Gap: LSM Tree WAL

The LSM tree's WAL (log-structured-merge-tree/lsm.py, line 26) calls self._fd.flush() but never calls fsync or fdatasync. This means its WAL entries can survive in the OS page cache without reaching stable storage — a durability gap. This isn't an fdatasync candidate; it's a missing sync entirely.

Platform Caveat

os.fdatasync() is a POSIX call available on Linux and macOS. On macOS, however, both fsync and fdatasync have the same weak semantics (neither guarantees write-through to the physical drive platter; you need fcntl(fd, F_FULLFSYNC) for that). So the fdatasync optimization only provides a real benefit on Linux with ext4/xfs/btrfs. A production implementation would want:


import sys

def _sync_data(fd):
    if sys.platform == "linux":
        os.fdatasync(fd.fileno())
    else:
        os.fsync(fd.fileno())

Topics to Explore

Beliefs