Function: diff in merkle-tree/merkle_tree.py

Date: 2026-05-29

Time: 11:48

MerkleTree.diff

Purpose

diff identifies which leaf positions differ between two Merkle trees. This is the core operation that makes Merkle trees useful for anti-entropy in distributed systems: two replicas each build a tree over their data, then diff the roots. If roots match, all data is identical — no further communication needed. If they don't, the diff drills down to find exactly which leaves diverged, transferring only O(log N) hashes instead of comparing every record.

Contract

Preconditions:

Postconditions:

Invariant: If self.roothash == other.roothash, the result is always an empty list. The recursive helper short-circuits at the root and never descends.

Parameters

| Parameter | Type | Description |

|-----------|------|-------------|

| other | MerkleTree | The tree to compare against. Must have been constructed with the same number of padded slots (same power-of-2 rounding). |

Note: two trees can have different leafcount values and still be diffable, as long as their paddedsize matches. For example, a tree with 5 leaves and one with 7 leaves both pad to 8, so they're comparable. The extra slots in the 5-leaf tree will have EMPTY_HASH, and those positions will show up as diffs if the 7-leaf tree has data there.

Return Value

List[int] — indices into the original data blocks where the two trees disagree. Sorted in ascending order (as a consequence of the left-to-right DFS in diffrecursive). The caller is responsible for interpreting what the indices mean — diff doesn't return the actual data, just the positions.

Algorithm

1. Guard: Reject trees with different paddedsize — their array layouts are incompatible.

2. Recursive descent via diffrecursive(other, 0, result), starting at the root (index 0):

3. Filter padding: The recursion may report diffs in padding positions (indices ≥ both leafcount values). The final list comprehension strips these out using max(self.leafcount, other.leafcount) as the upper bound. It uses max, not min — if tree A has 5 leaves and tree B has 7, positions 5 and 6 are real diffs (B has data that A doesn't), so they should be reported.

Side Effects

None. The method is purely read-only — it doesn't mutate either tree. The result list is local to the call.

Error Handling

Usage Patterns

Typical use is through KeyRangeMerkleTree.diff_keys, which translates leaf indices back to key names:


tree_a = KeyRangeMerkleTree(replica_a_pairs)
tree_b = KeyRangeMerkleTree(replica_b_pairs)
stale_keys = tree_a.diff_keys(tree_b)  # internally calls self._tree.diff(other._tree)

Caller obligation: the two trees must represent the same key-space partitioned identically. If replicas use different sort orders or different key sets, the positional indices are meaningless. KeyRangeMerkleTree handles this by sorting keys before construction.

Dependencies

Untyped Assumptions

Beliefs