json.dumps(sort_keys=True) is truly canonical (it isn't for floats, nested dicts with mixed types, or non-ASCII) and when this mattersDate: 2026-05-29
Time: 11:52
json.dumps(sort_keys=True) Truly Canonical?Short answer: no, but it's good enough for this codebase — barely.
The only place sort_keys=True is used for hashing is byzantine-fault-tolerance/pbft.py:46:
return hashlib.sha256(json.dumps(request, sort_keys=True, default=str).encode()).hexdigest()
This creates a digest that all PBFT replicas must agree on. If two honest nodes compute different hashes for the same logical request, consensus breaks. That makes this line load-bearing for correctness.
sort_keys=True Falls ShortFloats are not round-trip stable. json.dumps(0.1 + 0.2) produces 0.30000000000000004. The exact string representation of a float can vary across Python versions and platforms. If a request payload contains floating-point arithmetic results, two nodes running different Python builds could hash the same logical value differently.
default=str is a canonicalization escape hatch. Any non-JSON-serializable object gets converted via str(). This is dangerous — str(datetime.now()) includes microseconds, str(someobject) may include memory addresses, and str() output for custom classes is whatever repr_ happens to return. Two nodes receiving the same request object could serialize it differently if any field falls through to default=str.
Non-ASCII strings have multiple valid JSON encodings. By default Python's json.dumps uses ensureascii=True (escaping to \uXXXX), which is consistent within CPython. But if that default ever changes or someone passes ensureascii=False, the same Unicode string can be encoded as either the literal character or the escape sequence — both valid JSON, different bytes, different hashes.
Key ordering is only skin-deep. sort_keys=True sorts top-level and nested dict keys, but it sorts them lexicographically as Python strings. This is fine for ASCII keys but can produce different orderings depending on locale-aware string comparison in edge cases.
This is a reference implementation. The PBFT request dicts are constructed internally with simple string keys and string/integer values — no floats, no custom objects, no Unicode surprises. The default=str is a safety net that probably never fires in normal operation. In this controlled context, json.dumps(sort_keys=True) produces consistent output.
Every other json.dumps call in the codebase is for storage, not hashing:
| File | Line | Purpose |
|------|------|---------|
| event-sourcing-store/event_store.py | 132 | Append to WAL |
| batch-word-count/pipeline.py | 134, 147, 242, 320 | Write intermediate records |
| partitioned-log/partitioned_log.py | 427 | Write to partition files |
| unbundled-database/unbundled_database.py | 50 | Persist log entries |
None of these use sort_keys=True because they don't need to — they're writing data to be read back by json.loads, not compared byte-for-byte. Serialization roundtripping doesn't require canonical form.
The Merkle tree (merkle-tree/merkle_tree.py:12) avoids the problem entirely by hashing raw bytes directly:
def _sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
Callers provide the bytes — no serialization ambiguity. This is the correct approach when hash stability is critical.
If someone extended the PBFT implementation to handle requests with:
datetime or custom types (hitting default=str)Any of these would risk hash divergence across replicas. A production system would use a proper canonical serialization format (e.g., RFC 8785 JSON Canonicalization Scheme, canonical CBOR, or protobuf).