Date: 2026-05-28
Time: 18:05
I don't have read access to the target repo, but the structure is highly informative. Here's my analysis based on the directory layout, file naming conventions, and deep knowledge of the DDIA concepts involved.
The ~35 modules group naturally into the same layers as the book:
| Module | DDIA Chapter |
|--------|-------------|
| hash-index-storage / log-structured-hash-table | Ch 3 — Bitcask-style hash index |
| sstable-and-compaction | Ch 3 — Sorted String Tables |
| log-structured-merge-tree | Ch 3 — LSM-Tree |
| b-tree-storage-engine | Ch 3 — B-Tree |
| write-ahead-log | Ch 3 — WAL for crash recovery |
| bloom-filter | Ch 3 — Probabilistic membership |
| avro-serializer | Ch 4 — Schema evolution |
| Module | DDIA Chapter |
|--------|-------------|
| leader-follower-replication | Ch 5 — Single-leader |
| multi-leader-replication | Ch 5 — Multi-leader |
| leaderless-replication | Ch 5 — Dynamo-style |
| read-repair | Ch 5 — Anti-entropy |
| hinted-handoff | Ch 5 — Availability during partitions |
| leader-election | Ch 8 — Leader election |
| raft-consensus | Ch 9 — Raft |
| two-phase-commit | Ch 9 — 2PC |
| total-order-broadcast | Ch 9 — Total order broadcast |
| Module | DDIA Chapter |
|--------|-------------|
| consistent-hashing | Ch 6 — Hash partitioning |
| range-partitioning | Ch 6 — Range partitioning |
| secondary-index-partitioning | Ch 6 — Partitioned indexes |
| Module | DDIA Chapter |
|--------|-------------|
| snapshot-isolation | Ch 7 — MVCC |
| write-skew-detection | Ch 7 — SSI |
| fencing-tokens | Ch 8 — Fencing for distributed locks |
| Module | DDIA Chapter |
|--------|-------------|
| lamport-clocks | Ch 8 — Logical clocks |
| vector-clocks | Ch 8 — Vector clocks |
| Module | DDIA Chapter |
|--------|-------------|
| gossip-protocol | Ch 5/8 — Failure detection |
| byzantine-fault-tolerance | Ch 8 — PBFT |
| merkle-tree | Ch 5 — Anti-entropy verification |
| Module | DDIA Chapter |
|--------|-------------|
| mapreduce-framework | Ch 10 — MapReduce |
| batch-word-count | Ch 10 — Batch processing |
| map-side-join | Ch 10 — Join strategies |
| change-data-capture | Ch 11 — CDC |
| event-sourcing-store | Ch 11 — Event sourcing |
| stream-join-processor | Ch 11 — Stream joins |
| partitioned-log | Ch 11 — Log-based messaging |
| Module | DDIA Chapter |
|--------|-------------|
| unbundled-database | Ch 12 — Composing data systems |
| conflict-free-replicated-data-types | Ch 5/12 — CRDTs |
---
The exploration should follow the book's pedagogical arc: storage foundations first, then distribution, then derived data. Within each layer, start with the simpler concept and build toward the complex ones.
write-ahead-log/wal.py — WAL is the foundational crash-recovery primitive; replication, consensus, and event sourcing all depend on log-structured writeshash-index-storage/bitcask.py — Simplest storage engine (Bitcask); establishes the append-only log + in-memory index pattern that LSM and SSTable extendlog-structured-merge-tree/lsm.py — LSM-Tree ties together memtable, SSTable, and compaction; the most architecturally rich storage engine in the collectionb-tree-storage-engine/btree.py — B-Tree is the counterpoint to LSM; understanding both reveals the read-vs-write tradeoff that drives all of Chapter 3leader-follower-replication/replication.py — Single-leader replication is the baseline; establishes replication lag, failover, and log-based sync patternsleaderless-replication/dynamo.py — Dynamo-style quorum reads/writes, sloppy quorums, conflict resolution — the alternative to leader-based modelsraft-consensus/raft.py — Raft is the most complex protocol in the collection; understanding it unlocks leader election, log replication, and safety guaranteestwo-phase-commit/twophasecommit.py — 2PC shows how distributed transactions work (and why they're fragile); contrasts with consensussnapshot-isolation/mvcc_database.py — MVCC is the dominant concurrency control mechanism; snapshot isolation shows how reads and writes coexistvector-clocks/vector_clock.py — Vector clocks capture causality in distributed systems; they're the foundation for conflict detection in leaderless replicationconsistent-hashing/consistent_hashing.py — Consistent hashing solves the partition assignment problem; used by Dynamo, Cassandra, and most distributed storesconflict-free-replicated-data-types/crdts.py — CRDTs represent the state of the art in conflict-free replication; mathematically guaranteed convergencechange-data-capture/cdc.py — CDC bridges OLTP and derived data; it's the integration pattern that ties Part II to Part IIIunbundled-database/unbundled_database.py — The capstone module; composes multiple concepts (log, CDC, indexes) into a unified data system, reflecting Chapter 12's thesistester-test-pattern — Understanding the testertest*.py files that appear in most modules; likely a meta-testing or test validation harness that reveals how correctness is verified across all implementations---
1. WAL then Bitcask — establish the log-append primitive
2. LSM then B-Tree — the two storage engine families
3. Leader-follower then Leaderless then Raft — replication spectrum
4. 2PC then MVCC then Vector clocks — correctness mechanisms
5. Consistent hashing then CRDTs — partitioning and conflict resolution
6. CDC then Unbundled database — integration and capstone
7. Tester pattern — cross-cutting infrastructure understanding
This order mirrors the book's dependency chain: each concept builds on the ones before it, so beliefs extracted early will serve as antecedents for derived beliefs later.