Topic: None of the implementations use the atomic safe-save pattern (write temp, fsync, rename); explore whether compaction and SSTable creation should adopt it

Date: 2026-05-29

Time: 06:37

The Missing Atomic Safe-Save Pattern

What the Pattern Is

The atomic safe-save pattern is a well-known technique for crash-safe file creation:

1. Write to a temporary file in the same directory

2. fsync() the temporary file (flush to disk)

3. rename() the temp file to the final path (atomic on POSIX)

If a crash occurs during step 1 or 2, the final path either doesn't exist yet or still has its old content. You never see a half-written file at the target path.

What the Implementations Actually Do

Every file-writing path in this codebase writes directly to the final destination. Let's trace each one:

SSTable Creation (sstable-and-compaction/sstable.py:49)

SSTableWriter._init_ opens the final filepath immediately:


self._f = open(filepath, "wb")
self._f.write(struct.pack(HEADER_FMT, MAGIC, VERSION, 0))

Then finish() (line ~85) writes the index, footer, seeks back to update the header entry count, and closes the file. No fsync at all — the data may still be in the OS page cache when finish() returns. A crash mid-write leaves a corrupt, partially-written SSTable at the final path.

LSM SSTable Flush (log-structured-merge-tree/lsm.py:80)


with open(path, "wb") as f:

Same pattern — writes directly to the final SSTable path. The _flush() method (line 303) writes the memtable contents here, then adds the SSTable to the live list. A crash during the write leaves an incomplete SSTable that the recovery code will try to open.

Bitcask Compaction (hash-index-storage/bitcask.py)

The compact() method (visible from the structure) creates a new data file and writes merged entries. It also writes hint files directly (writehint_file, line 159):


with open(self._hint_path(file_id), "wb") as f:

No fsync, no temp file. A crash during compaction can leave both the old and new data files in inconsistent states.

Log-Structured Hash Table Compaction (log-structured-hash-table/bitcask.py:264)


with open(compacted_path, "wb") as out:

Direct write to the compacted segment path. Then hint files at line 370:


with open(hint_path, "wb") as hf:

Again, no temp-file indirection.

WAL Truncation (write-ahead-log/wal.py:203)


with open(path, "wb") as f:

This one is particularly interesting — the WAL truncate() method rewrites WAL files in place to remove old records. If the process crashes mid-rewrite, you lose both the old records AND the records that should have been kept. The WAL — the component whose entire purpose is crash recovery — is itself not crash-safe during truncation.

The Contrast: What They Do Get Right

The WAL implementation (write-ahead-log/wal.py) does use fsync correctly for its append pathdosync() (around line 125) calls os.fsync(self.fd.fileno()) after writes, and the Bitcask implementation (hash-index-storage/bitcask.py:92) has a syncwrites option that fsyncs after each record write. So the concept of durability is present; it's just not applied to the file-creation operations that are most vulnerable.

Should Compaction and SSTable Creation Adopt It?

Yes, unambiguously — especially for compaction. Here's why:

SSTable Creation (memtable flush)

When a memtable flushes to an SSTable, the WAL still holds the data. So in theory, a crash during flush can recover from the WAL. But:

Compaction

Compaction is the most critical place for atomic writes:

What the Fix Looks Like


import tempfile

def atomic_write(final_path, write_fn):
    dir_name = os.path.dirname(final_path)
    fd, tmp_path = tempfile.mkstemp(dir=dir_name)
    try:
        with os.fdopen(fd, 'wb') as f:
            write_fn(f)
            f.flush()
            os.fsync(f.fileno())
        os.rename(tmp_path, final_path)
        # fsync the directory to persist the rename
        dir_fd = os.open(dir_name, os.O_RDONLY)
        os.fsync(dir_fd)
        os.close(dir_fd)
    except:
        os.unlink(tmp_path)
        raise

The directory fsync on the last line is often overlooked — without it, the rename metadata itself may not be persisted.

The WAL Truncation Case

The truncate() in write-ahead-log/wal.py is the most dangerous omission. It rewrites WAL files to remove already-checkpointed records. This should absolutely use the temp-write-rename pattern, because losing a WAL file during truncation defeats the entire purpose of having a WAL. The current implementation at line 203 opens the file with "wb" which immediately truncates the existing content before writing the filtered records.

Topics to Explore

Beliefs