Function: get in log-structured-hash-table/bitcask.py

Date: 2026-05-29

Time: 08:37

BitcaskStore.get — Point lookup by key

Purpose

get is the read path of the Bitcask storage engine. It retrieves a value for a given key by consulting the in-memory hash index to find the record's physical location (segment file + byte offset), then performing a single disk seek to read and validate the record. This is the core design principle of Bitcask: O(1) index lookup in memory, one disk read per query.

Contract

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| key | str | The lookup key. Must match exactly (no prefix/range support). No length validation — any string the caller passes will be checked against the index. |

Return Value

Algorithm

1. Index lookup — Check self._index for key. If absent, short-circuit with None. This is a dict lookup: O(1) average case.

2. Locate on disk — Unpack (segpath, offset) from the index entry. segpath is the absolute path to the segment file; offset is the byte position where this record's header begins.

3. Open a fresh file handle — Opens segpath in binary-read mode via a with block. This is deliberate: the class also maintains a file_handles cache elsewhere, but get avoids it. The comment says "to avoid position conflicts" — because a cached handle's file position would be shared across calls, causing reads to land at wrong offsets under interleaved access.

4. Read the header — Seeks to offset, reads HEADERSIZE (12 bytes: three unsigned 32-bit ints in network byte order). Unpacks into (crc, keysize, value_size).

5. Read the payload — Reads keysize + valuesize bytes immediately following the header. The payload is keybytes || valuebytes concatenated.

6. CRC validation — Computes zlib.crc32 over the payload and masks to 32 bits. Compares against the stored crc. This detects bit-rot, partial writes, and file corruption.

7. Extract value — Returns payload[key_size:], slicing off the key prefix to yield only the value bytes.

Side Effects

Minimal. Opens and closes a file descriptor on each call (the with block). No mutations to self._index, no writes to disk, no network I/O. The file handle is not cached — each get pays the open/close cost but avoids stale-position bugs.

Error Handling

| Condition | Behavior |

|-----------|----------|

| Key not in index | Returns None (normal miss) |

| Header read returns fewer than 12 bytes | Raises CorruptionError |

| Payload read returns fewer than expected bytes | Raises CorruptionError |

| CRC mismatch | Raises CorruptionError with expected vs actual CRC |

| Segment file missing (deleted or moved) | Unhandled FileNotFoundError propagates to caller |

| Key was deleted | Returns Nonedelete() removes the index entry, so it never reaches disk |

Note that get does not check whether the value is a tombstone (TOMBSTONE). It doesn't need to: delete() removes the key from the index, so a tombstone record is never reachable through get. The tombstone sentinel only matters during recovery (scansegment) and compaction.

Usage Patterns


store = BitcaskStore("/tmp/mydb")
store.put("user:1", b'{"name": "Alice"}')
value = store.get("user:1")    # b'{"name": "Alice"}'
value = store.get("missing")   # None

Callers should:

Dependencies

Assumptions not enforced by the type system

1. The segment file still exists at seg_path — compaction deletes old segments and updates the index, but if the process crashes between deletion and index update, get will raise FileNotFoundError.

2. The store hasn't been close()d — no guard prevents calling get after close().

3. offset is a valid position — the index is trusted implicitly; a corrupted index would cause reads at arbitrary offsets.

4. Keys are valid UTF-8get never decodes the key (it only slices past it), but the write path encodes with UTF-8, so the stored key length in bytes may differ from len(key) in characters.

5. Values are never equal to TOMBSTONE — if a caller stores the literal bytes b"_BITCASKTOMBSTONE__" as a value, recovery will treat it as a deletion.