Original source: topic-ddia-chapter-6-partitioning

Topic: DDIA Chapter 6 covers the theory behind partition strategies, rebalancing, and request routing that this code implements

Date: 2026-05-29

Time: 14:00

DDIA Chapter 6: Partitioning — From Theory to Code

Chapter 6 of *Designing Data-Intensive Applications* lays out three core problems: how to partition data, how to rebalance when nodes change, and how to route requests to the right partition. This codebase implements all three, each in a separate module that isolates one strategy.

Rebalancing

Dynamic Splitting and Merging

Range partitioning uses dynamic rebalancing via split and merge — the strategy Kleppmann associates with HBase and RethinkDB. When a partition exceeds maxpartitionsize, put automatically splits it at the median key (lines 118–121 of rangepartitioning.py). The Partition.split method (lines 69–79) creates a new right partition and adjusts boundaries. Conversely, mergesmall_partitions (lines 152–166) recombines adjacent undersized partitions.

Consistent Hashing Rebalancing

When a node is added or removed from the hash ring, addnode (lines 27–46) and removenode (lines 48–66) return transfer maps: {(arcstart, arcend): (fromnode, tonode)}. This models what Kleppmann describes as the key advantage of consistent hashing — only keys in the affected arcs move, not the entire dataset. The weight parameter (line 27) supports heterogeneous nodes by scaling virtual node count.

Consumer Group Rebalancing

partitioned-log/partitioned_log.py implements Kafka-style consumer group rebalancing (line 317: rebalance). When consumers join or leave a group (lines 305, 309), partitions are redistributed. The Producer class (lines 144–175) handles the routing side: keyed messages hash to a fixed partition (line 161), while unkeyed messages round-robin (lines 163–165) — exactly the two strategies Kleppmann describes for Kafka producers.

---

Request Routing

The code demonstrates what Kleppmann calls the partition-aware client approach. There's no separate routing tier; instead, the store itself maintains routing metadata:

In RangePartitionedStore, the boundaries list acts as a partition map. findpartitionindex (lines 110–112) is the routing function — a binary search that resolves any key to its owning partition in O(log n).
In ConsistentHashRing, get_node (lines 78–84) is the routing function — walk clockwise from the key's hash position to find the owning node.
In DocumentPartitionedDB, primary key lookups route to exactly one partition via partitionfor (line 57: hash(pk) % num_partitions), while secondary index queries must fan out.

The getnodes method in consistent hashing (lines 86–101) extends routing to support replication: it walks the ring collecting replicationfactor distinct physical nodes, implementing the preference list concept from Dynamo that Kleppmann describes.

---

Topics to Explore

[file] secondary-index-partitioning/secondaryindexpartitioning.py — Read the full TermPartitionedDB implementation (lines 200+) to see how writes fan out to multiple index partitions and how the async index update queue works
[function] partitioned-log/partitioned_log.py:rebalance — The consumer group rebalancing algorithm at line 317; compare with Kafka's cooperative sticky assignor
[function] consistent-hashing/consistenthashing.py:getload_distribution — Measures actual arc ownership per node (lines 117–131); useful for understanding why 150 vnodes is the default and how weight affects balance
[file] hinted-handoff/hinted_handoff.py — The coordinator at line 72 routes requests and handles node failures, connecting partitioning to fault tolerance (Chapter 5 meets Chapter 6)
[general] scatter-gather-cost — Compare DocumentPartitionedDB.querybyfield (touches all partitions) vs TermPartitionedDB query (touches one) with concrete metrics from get_stats()

Beliefs

range-partition-split-at-median — Partition.split always divides at the median key (index len // 2), which can produce uneven partitions when key distribution is skewed
consistent-hash-ring-uses-md5 — All ring positions are computed via MD5 truncated to 32 bits; the hash function is not pluggable
document-partitioned-query-touches-all — DocumentPartitionedDB.querybyfield always scans every partition regardless of selectivity, and records num_partitions as partitions touched
term-partitioned-write-fans-out — A single TermPartitionedDB.put may update global index entries on multiple partitions (one per indexed field value), making writes more expensive than in document-partitioned mode
range-partition-routing-is-log-n — RangePartitionedStore.findpartitionindex uses bisect.bisectright on the sorted boundary list, giving O(log n) key-to-partition resolution