Function: writerecord in log-structured-hash-table/bitcask.py

Date: 2026-05-29

Time: 07:27

writerecord — Append a single key-value record to the active segment

Purpose

writerecord is the low-level write primitive for the Bitcask store. Every mutation — put, delete, and compaction — ultimately serializes data to disk through this method. It encodes a key-value pair into the on-disk binary format (header + payload), appends it to the current active segment file, and returns the byte offset where the record begins. That offset is what the in-memory hash index stores to locate the record later.

Contract

Preconditions:

Postconditions:

Invariant: The on-disk record is self-describing — the header contains enough information (key size, value size) to read the record without external metadata, and the CRC covers the full payload for integrity verification on read.

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| key | str | The logical key. Encoded to UTF-8 bytes internally. No maximum length enforced. |

| value | bytes | The raw value to store. May be actual user data or the TOMBSTONE sentinel. |

Edge cases: An empty string key ("") produces zero key bytes but is technically valid. An empty value (b"") is also valid — the CRC and sizes will reflect zero-length value.

Return Value

Returns int — the byte offset within the active segment file where this record's header starts. The caller uses this offset (paired with the file path) to build the index entry for O(1) lookups. The caller is responsible for updating self._index; this method does not touch the index.

Algorithm


1. Encode the key string to UTF-8 bytes.
2. Concatenate key_bytes + value to form the payload.
3. Compute CRC-32 over the payload, masked to 32 bits unsigned.
4. Pack a 12-byte header: [crc32 | key_size | value_size] in network byte order (!III).
5. Capture the current file position (this is the record's offset).
6. Write header + payload as a single contiguous write.
7. Flush the write buffer.
8. Return the captured offset.

The CRC covers only the payload (key bytes + value), not the header itself. This means a corrupted header (e.g., wrong sizes) won't be caught by the CRC — but it will cause the reader to extract the wrong payload slice, which will then fail the CRC check indirectly.

Side Effects

Error Handling

This method does not catch any exceptions. Possible failures include:

All of these propagate to the caller unhandled.

Usage Patterns

Called from three sites:

1. put(key, value) — writes user data, then updates the index with the returned offset.

2. delete(key) — writes the key with TOMBSTONE as value, then removes the key from the index.

3. compact() — rewrites live records to a new segment (though compaction actually bypasses writerecord and writes directly — it re-serializes records inline in the compaction loop).

The leading underscore signals this is an internal method. Callers must handle segment rotation *before* calling writerecord — the method itself has no size-checking logic.

Dependencies

On-Disk Record Layout


┌──────────────────────── HEADER (12 bytes) ─────────────────────────┐
│  CRC32 (4B)  │  key_size (4B)  │  value_size (4B)                 │
├──────────────────────── PAYLOAD (variable) ────────────────────────┤
│  key_bytes (key_size B)  │  value_bytes (value_size B)            │
└───────────────────────────────────────────────────────────────────-┘

This is the exact format that scansegment and get expect when reading back.