Function: readrecord in hash-index-storage/bitcask.py

Date: 2026-05-29

Time: 11:58

readrecord — Point-read a single record from a Bitcask data file

Purpose

readrecord is the read-path counterpart to writerecord. Given a file ID, byte offset, and record size, it deserializes a single key-value record from disk. This is the only way the store retrieves values — the in-memory keydir hash index stores the location coordinates, and readrecord does the actual I/O to fetch the data.

It exists because Bitcask separates the index (in-memory hash map of key → location) from the data (append-only files on disk). Reads are always single-seek operations: look up the location in keydir, then call this method to fetch the bytes.

Contract

Preconditions:

Postconditions:

Invariant: The returned key and value are the exact strings that were passed to writerecord when the record was created.

Parameters

| Parameter | Type | Meaning |

|-----------|------|---------|

| fileid | int | Identifies which .data file to read from (maps to {fileid}.data on disk) |

| offset | int | Byte offset within that file where the record starts |

| size | int | Total byte length of the record (header + key bytes + value bytes) |

All three values come from a KeyEntry in self.keydir. The caller never computes these — they were captured at write time by writerecord.

Return Value

Returns (key, value, timestamp) where:

The caller (get()) is responsible for checking whether value == "" indicates a tombstone/deletion.

Algorithm

1. Obtain a read handle via getreader(file_id), which lazily opens the file in "rb" mode and caches the handle.

2. Seek to the exact byte offset where the record begins.

3. Bulk-read the entire record (size bytes) into memory in a single I/O call — this is important because it avoids multiple small reads.

4. Unpack the header — the first 16 bytes (HEADERSIZE) are parsed as (timestamp, keysize, val_size) using the little-endian format <dII.

5. Slice and decode the key — bytes [16 : 16 + key_size], decoded as UTF-8.

6. Slice and decode the value — bytes [16 + keysize : 16 + keysize + val_size], decoded as UTF-8.

7. Return the triple.

The key insight: size is used only for the bulk read. The actual field boundaries come from keysize and valsize unpacked from the header. The size parameter is redundant with HEADERSIZE + keysize + val_size — it's passed for convenience so the method can do a single read() call without first reading just the header.

Side Effects

Error Handling

There is essentially none — the method assumes valid inputs:

Usage Patterns

Called exclusively by get():


def get(self, key):
    entry = self.keydir.get(key)
    if entry is None:
        return None
    read_key, value, _ = self._read_record(entry.file_id, entry.offset, entry.size)
    assert read_key == key  # sanity check

The assert in get() acts as a weak integrity check — if the keydir is out of sync with the data files, the key won't match. But this is a development-time guard, not a production safety net (assertions can be disabled with -O).

Also used implicitly during compaction, where old records are re-read via direct file I/O rather than through this method (compaction opens its own temporary readers).

Dependencies

Topics to Explore

Beliefs