Topic: How this implementation compares to the original Bitcask paper (missing CRC, merge triggers, expiry)

Date: 2026-05-29

Time: 12:00

How This Implementation Compares to the Original Bitcask Paper

The repository contains two separate Bitcask implementations that take different stances on fidelity to the original Bitcask paper. The divergences cluster around three areas: data integrity (CRC), merge/compaction triggers, and key expiry.

2. Merge / Compaction Triggers

The paper describes a merge process that runs when the number of immutable (non-active) data files crosses a threshold, and it produces new merged files plus hint files for fast restart.

| Implementation | Trigger | Behavior |

|---|---|---|

| hash-index-storage/bitcask.py | Manual only | compact() is a public method (line 198). Nothing in put or mayberotate (line 105) triggers it automatically. The caller must decide when to compact. |

| log-structured-hash-table/bitcask.py | Automatic threshold | The constructor takes autocompactthreshold (default 5, line 39). Inside put (lines 157–160), after every write, it counts frozen segments and calls self.compact() when the count reaches the threshold. This is close to the paper's model. |

The hash-index-storage version's manual-only compaction means that in a long-running process, dead data accumulates indefinitely unless the application explicitly calls compact(). The paper considers automatic triggering essential because unbounded file growth degrades startup time (more files to scan) and wastes disk.

---

3. Key Expiry / TTL

The paper specifies an optional per-key expiry timestamp. During merge, expired keys are dropped just like tombstones. Neither implementation supports this:

This means neither implementation can serve use cases like session stores or caches where keys should auto-evict. To add expiry, you'd need to:

1. Add a TTL field to the record header.

2. Check it in get (return None + optionally write a tombstone).

3. Skip expired keys during merge/compaction.

---

4. Other Notable Divergences

Tombstone encoding. The paper uses a dedicated tombstone marker. hash-index-storage/bitcask.py uses an empty-string value ("") as the tombstone sentinel (line 84: self.writerecord(key, "")), which means empty strings can't be stored as legitimate values — get returns None for "" at line 176. The log-structured-hash-table version uses a proper sentinel byte sequence TOMBSTONE = b"_BITCASKTOMBSTONE__" (line 9), which is more robust but still technically collides if someone stores that exact byte pattern.

Value types. hash-index-storage stores string values; log-structured-hash-table stores raw bytes. The paper is byte-oriented, so the latter is closer.

---

Summary Table

| Feature | Paper | hash-index-storage | log-structured-hash-table |

|---|---|---|---|

| CRC per record | Required | Missing | Present |

| Hint files | Required | Present | Present |

| Auto merge trigger | Described | Missing (manual only) | Present (threshold) |

| Key expiry / TTL | Optional | Missing | Missing |

| Tombstone encoding | Dedicated marker | Empty string (leaky) | Sentinel bytes |

| fsync on write | Recommended | Configurable | Not called |

---

Topics to Explore

Beliefs