Topic: The SSTable format in lsm.py stores its sparse index as a footer; explore what happens if the footer's 12-byte trailer is corrupted vs. a mid-file data entry

Date: 2026-05-29

Time: 10:06

SSTable Footer Corruption vs. Mid-File Data Corruption

The file layout

The SSTable.write() method (lines 83–103 of lsm.py) produces a file with three regions:


[ data entries... | sparse index entries... | footer_start (8B) | count (4B) ]
                  ^                          ^--- trailer (12 bytes) ---------^
                  |
                  footer_start points here

The last 12 bytes — the trailer — are the anchor for the entire file. Every read operation begins by seeking to those 12 bytes to find the sparse index, which in turn locates data entries.

Scenario 2: Corrupted mid-file data entry

readentry() at lines 181–196 reads each entry as [4B klen][key][4B vlen][value]. Corruption here has a narrower blast radius but a subtle secondary effect:

If the key-length field is corrupted:

If the value bytes are corrupted but lengths are intact:

The sparse index limits the damage. Because get() at lines 119–130 only scans between two adjacent sparse index offsets (startoff to endoff), a corrupted entry only affects lookups for keys in that one segment. Keys indexed by other sparse index entries are found via clean segments — they're unaffected.

---

The asymmetry

| | Corrupted trailer | Corrupted data entry |

|---|---|---|

| Blast radius | Entire SSTable | One sparse-index segment (up to sparseindexinterval entries, default 16) |

| Detection | Usually crashes (struct.error) | Often silent (wrong data, skipped entries) |

| Recovery possible? | No — the anchor is gone | Partial — other segments still readable |

| Defense in real systems | Magic numbers, checksums on footer | Per-block CRCs (e.g., LevelDB uses CRC32 on every block) |

The code has zero integrity checking — no checksums, no magic bytes, no redundant metadata. This is typical of a teaching implementation, but it makes the trailer corruption especially catastrophic since that single 12-byte anchor is an unprotected single point of failure for the entire file.

---

Topics to Explore

Beliefs