File: merkle-tree/merkle_tree.py

Date: 2026-05-29

Time: 11:58

Purpose

This file implements a Merkle tree — a hash tree used for efficient data integrity verification and anti-entropy synchronization. It's a reference implementation for the concept covered in *Designing Data-Intensive Applications*, where Merkle trees appear in the context of detecting inconsistencies between replicas (e.g., Dynamo-style storage systems use them for anti-entropy repair).

The file owns three responsibilities:

1. Core Merkle tree (MerkleTree) — array-backed binary hash tree with O(log N) updates, inclusion proofs, and tree-diff

2. Key-range overlay (KeyRangeMerkleTree) — adapts the core tree for key-value anti-entropy, the primary DDIA use case

3. Streaming construction (MerkleTreeBuilder) — collects leaves incrementally before building

Key Components

Constants and Helpers

MerkleNode

A dataclass for on-demand tree traversal. Not used internally for storage — the tree is array-backed. Nodes are constructed lazily by buildnode() when callers need an object graph (e.g., the root property). The index field is only meaningful for leaf nodes.

MerkleProof

Carries an inclusion proof: the leaf hash, its index, the sibling hashes along the path to root (each tagged with "left" or "right" to indicate which side the sibling sits on), and the expected root hash. This is everything a verifier needs without access to the full tree.

MerkleTree

The core implementation. Uses a flat array (self._hashes) with implicit parent-child indexing:

Key methods:

| Method | Contract |

|--------|----------|

| _init(datablocks) | Builds tree bottom-up in O(N). Pads to power-of-2 with EMPTY_HASH. |

| updateleaf(index, newdata) | Mutates one leaf and walks up to root recomputing hashes. O(log N). |

| get_proof(index) | Returns a MerkleProof by collecting sibling hashes from leaf to root. |

| verify_proof(data, proof) | Static method. Recomputes the root hash from leaf data + siblings; returns True if it matches. |

| diff(other) | Returns indices of differing leaves between two same-sized trees. Prunes matching subtrees. |

| todict() / fromdict() | Serialization round-trip. Data blocks are hex-encoded. |

KeyRangeMerkleTree

Wraps MerkleTree for the anti-entropy use case. Sorts key-value pairs by key, encodes each as "key:value" bytes, and delegates to the core tree. The diff_keys() method translates leaf indices back to keys — this is what a replica would call to find which keys are out of sync.

MerkleTreeBuilder

Trivial accumulator. Collects leaves via addleaf(), then calls MerkleTree(self.leaves) on build(). Useful when data arrives incrementally.

Patterns

Array-backed binary tree. Instead of pointer-based nodes, the tree stores all hashes in a flat list with implicit index arithmetic. This is cache-friendly and avoids object overhead. The MerkleNode class exists only for external consumers who want a tree-shaped view — it's never used internally.

Power-of-2 padding. The tree always pads to a complete binary tree using EMPTYHASH. This simplifies the index math but means the diff operation must filter out padding indices afterward (the maxidx filter in diff()).

Bottom-up construction. Leaves are hashed first, then internal nodes are computed from leaf_start - 1 down to index 0. This is the standard O(N) Merkle tree build.

Hash chaining. Internal nodes hash the *concatenation of the hex strings* of their children: sha256((lefthex + righthex).encode()). This is a common but specific choice — the hex strings are concatenated as text, then encoded to bytes for hashing.

Dependencies

Imports: Only hashlib and stdlib typing/dataclass — no external dependencies. This is a pure data structure with no I/O.

Imported by: Two test files (testmerkle.py, testmerkle_tree.py), and presumably used by anti-entropy components in the broader DDIA implementations (e.g., could plug into read-repair or leaderless-replication).

Flow

Construction

1. Caller provides data_blocks: List[bytes]

2. nextpowerof2 determines padding size

3. Leaf hashes fill indices [paddedsize - 1, 2*paddedsize - 2], with EMPTY_HASH for padding

4. Internal nodes compute bottom-up from padded_size - 2 down to 0

5. self._hashes[0] is the root hash

Proof generation and verification

1. get_proof(index) walks from leaf to root, collecting the sibling hash and its side ("left"/"right") at each level

2. verify_proof(data, proof) starts from sha256(data), then for each sibling applies: sha256(left + right) with the sibling on the correct side

3. If the recomputed hash equals proof.root_hash, the data is verified

Diff (anti-entropy)

1. Starts at root (index 0)

2. If hashes match, the entire subtree is identical — prune

3. If hashes differ and we're at a leaf, record the index

4. Otherwise recurse into both children

5. Filter out padding indices at the end

This is the key efficiency gain: diffing two trees with N leaves where K differ is O(K log N) instead of O(N).

Invariants

Error Handling

Straightforward — the code validates at boundaries and raises stdlib exceptions:

No errors are swallowed. verify_proof returns a boolean rather than raising on mismatch.

Topics to Explore

Beliefs