Function: get in log-structured-merge-tree/lsm.py

Date: 2026-05-29

Time: 14:21

SSTable.get — Point Lookup via Sparse Index

Purpose

This method performs a point lookup for a single key within a sorted string table (SSTable). SSTables store key-value pairs in sorted order on disk, and this method uses a sparse index — a sampled subset of keys with their file offsets — to narrow the search region before doing a linear scan. It exists because reading the entire SSTable sequentially for every lookup would be prohibitively slow; the sparse index turns an O(n) scan into an O(n/k) scan where k is the index interval (default 16).

Contract

Preconditions:

Postconditions:

Invariant: The sparse index entries are in sorted key order, matching the on-disk sort order.

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| key | str | The key to search for. Must match the encoding used at write time (UTF-8). |

Return Value

Tuple[bool, Optional[bytes]] — A two-element tuple:

Algorithm

1. Empty index guard — If sparseindex is empty (no data or load_index() never called), return miss immediately.

2. Extract index keys — Build a list of just the keys from the sparse index (dropping offsets). This is O(m) where m is the number of index entries (~total_entries / 16).

3. Binary search — Use bisect.bisectright(keys, key) - 1 to find the rightmost index entry whose key is ≤ the target. bisectright returns the insertion point *after* any equal keys, so subtracting 1 gives the last entry that could be a prefix of the range containing our key.

4. Clamp to zero — If idx < 0 (the target key is smaller than every indexed key), clamp to 0. This is correct: the target might still exist between the start of file and the first indexed key.

5. Compute scan range — The scan region runs from sparseindex[idx][1] (the file offset of the index entry) to either the next index entry's offset or the footer start (if this is the last index entry). This bounds the linear scan to at most sparseindexinterval entries.

6. Linear scan — Delegate to scanrangeforkey, which reads entries sequentially within [startoff, endoff), returning on exact match or early-exiting when it passes the target key (since entries are sorted).

Side Effects

Error Handling

No explicit error handling. The method will raise:

None of these are caught — they propagate to the caller.

Usage Patterns

Called by LSMTree.get, which searches SSTables from newest to oldest:


for sst in reversed(self._sstables):
    found, v = sst.get(key)
    if found:
        return None if v == TOMBSTONE else v.decode("utf-8")

The caller is responsible for:

1. Interpreting TOMBSTONE as a deletion (not returning b"" as a valid value).

2. Searching SSTables in reverse sequence order so newer writes shadow older ones.

3. Ensuring load_index() was called after opening an existing SSTable from disk.

Dependencies

Notable Assumptions

1. Keys in sparseindex are sorted. The binary search is only correct if this holds. It's guaranteed by construction (write iterates a pre-sorted list, load_index reads them in file order), but there's no runtime assertion.

2. The scan range [startoff, endoff) contains all keys between two consecutive index keys. This relies on the data being written in sorted order with no gaps — valid for a properly constructed SSTable, but not validated at read time.

3. Extracting keys into a list every call (keys = [k for k, in self.sparse_index]) is O(m) allocation on every lookup. For a hot path, this could be cached. The current implementation trades memory for simplicity.

Beliefs