Date: 2026-05-29
Time: 08:52
avro-serializer/avro_serializer.pyThis file is a from-scratch implementation of Apache Avro's binary serialization format, built to demonstrate the schema evolution concepts from DDIA Chapter 4 (Encoding and Evolution). It owns the entire encode/decode pipeline: schema parsing, binary serialization, schema resolution (reading data written with one schema using a different schema), compatibility checking, and a simple schema registry. This is not a wrapper around the avro library — it reimplements the wire format and resolution rules directly.
PRIMITIVES — the eight Avro primitive type names. Used as the ground truth for type validation throughout.NODEFAULT — sentinel object for distinguishing "no default provided" from None as a default value. Used in canonical form computation.PROMOTIONS — set of (from, to) pairs defining which type widening is legal during schema resolution: int→long→float→double, plus int→float, int→double, long→double. This is the Avro spec's promotion rule.zigzagencode, zigzagdecode, writevarint, readvarint, writelong, readlong)Avro encodes integers using zigzag encoding on top of variable-length encoding. Zigzag maps signed integers to unsigned (0→0, -1→1, 1→2, -2→3, ...) so small-magnitude negatives use few bytes. writelong/readlong compose zigzag with varint and are the workhorses — they encode every integer in the format: lengths, array counts, union indices, enum ordinals, and the int/long types themselves.
SchemaParses an Avro-like schema definition (JSON-compatible Python dicts/lists/strings) into an internal representation. Validates structure eagerly in _parse:
"type" — supports record, array, map, enum, fixed, and bare primitives.The class exposes typed properties (fields, items, values, symbols, uniontypes, size) that return empty/None for inapplicable types. canonical_form() produces a hashable tuple representation for equality/hashing — two Schema objects with structurally identical definitions are equal regardless of construction path.
AvroEncoderSerializes Python values to bytes given a writer schema. The encode method dispatches on schema.typename:
null writes nothing, boolean writes one byte, int/long use zigzag+varint, float/double use little-endian IEEE 754.string/bytes: length-prefixed with a zigzag long.record: fields encoded in declaration order — no field names in the output.array/map: block-encoded with a count prefix, terminated by a 0-count block.union: writes the branch index as a long, then the value.enum: writes the symbol's ordinal index.fixed: writes raw bytes (must match declared size exactly).matchunion resolves which union branch to use via Python type inspection. It checks bool before int (since bool is an int subclass in Python), and for ambiguous dict values, prefers record if the keys match field names.
AvroDecoderReads binary data using a writer schema (what the data was written with) and resolves it against a reader schema (what the consumer expects). This is the core of schema evolution. The _decode method handles:
int→long, int→float, etc., via _promote.resolverecord): reads all writer fields in wire order, matches them to reader fields by name, skips unknown writer fields, and fills missing reader fields from defaults. This is where forward/backward compatibility happens._skip: reads and discards a value without allocating — used when the writer has a field the reader doesn't care about.checkcompatibility / resolve_checkPerforms a dry-run of schema resolution without any data, checking whether a reader can consume writer-produced data (backward compatibility) and vice versa (forward compatibility). Returns a dict with backwardcompatible, forwardcompatible, full_compatible, and a list of errors.
SchemaRegistryA minimal in-memory registry that assigns auto-incrementing integer IDs to schemas. encodewithid prepends a 4-byte big-endian schema ID to the encoded payload; decodewithid strips it and looks up the writer schema. This models Confluent's Schema Registry pattern where the writer schema ID travels with the message.
Schema.parse, AvroEncoder.encode, and AvroDecoder._decode use recursive dispatch on the type discriminator. The schema tree and the encode/decode logic mirror each other structurally.NODEFAULT avoids conflating "default is None" with "no default specified" — important because null is a valid Avro default.Imports: Only stdlib — io.BytesIO for buffer management, struct for IEEE 754 float encoding and the 4-byte schema ID header. No external dependencies.
Imported by: testavro.py and testavro_serializer.py — two test suites that exercise the serializer and schema evolution scenarios.
Encode path: caller creates Schema(definition) → creates AvroEncoder(schema) → calls encoder.encode(value) → recursive _encode walks the schema tree and writes bytes to a BytesIO buffer → returns bytes.
Decode path: caller creates AvroDecoder(writerschema, readerschema) → calls decoder.decode(data) → recursive _decode reads from a BytesIO buffer using the writer schema to know the wire layout, but returns values shaped by the reader schema.
Registry path: registry.encodewithid(sid, value) → encodes normally → prepends 4-byte ID. registry.decodewithid(data, reader_schema) → strips ID → looks up writer schema → decodes with resolution.
1. Field order is wire order: Record fields are encoded/decoded in the order declared in the writer schema. Reordering fields in the schema changes the wire format.
2. No self-describing format: The binary output contains no type tags, field names, or schema metadata. You cannot decode without the writer schema.
3. Names must match for named types: Records, enums, and fixed types require matching name fields during resolution — you can't rename a type and maintain compatibility.
4. Promotion is one-directional: int→long works, long→int does not. The PROMOTIONS set defines all legal promotions.
5. Missing reader fields require defaults: During record resolution, if the reader has a field not present in the writer, it must have a default or resolution fails with SchemaCompatibilityError.
6. Union branch uniqueness: A union cannot contain two branches with the same type name (or record name), enforced at parse time.
7. Schema IDs are 4-byte big-endian unsigned: The registry prepends struct.pack('>I', schema_id) — schema IDs are limited to [0, 2^32).
Two custom exceptions partition the error space:
SchemaError — raised during Schema._init_ for structurally invalid definitions (unknown types, missing names, duplicate fields). This is a parse-time error; if you have a Schema object, it's valid.SchemaCompatibilityError — raised during decode resolution or compatibility checking when schemas can't be reconciled (name mismatches, missing fields without defaults, incompatible types).ValueError — raised during encoding for data-level issues (value out of range for int, wrong byte count for fixed, no matching union branch) and during decoding for truncated input.KeyError — raised by SchemaRegistry.get for unknown schema IDs.Errors are never swallowed. The compatibility checker catches SchemaCompatibilityError internally to report rather than raise, but the check functions always return a result.
avro-serializer/testavroserializer.py — See what schema evolution scenarios are tested: field addition/removal, type promotion, enum symbol changesavro-block-encoding — How Avro's block structure for arrays/maps enables skipping without parsing elements, and why the encoder never uses negative countsavro-serializer/avroserializer.py:resolve_record — The heart of schema evolution — trace through a scenario where writer has fields A,B,C and reader has B,C,D with defaultconfluent-schema-registry-protocol — Compare SchemaRegistry.encodewithid to Confluent's wire format (magic byte + 4-byte ID) to see what's simplified hereavro-serializer/test_avro.py — Likely tests the lower-level encoding/decoding without schema evolution, good for understanding the wire formatavro-no-self-describing — The binary encoding contains no type information; decoding requires the exact writer schema that produced the bytesavro-record-field-order-is-wire-order — Record fields are serialized in schema declaration order with no field identifiers, so reordering fields in the writer schema is a breaking changeavro-promotion-asymmetric — Type promotion during resolution is one-directional (int→long, int→float, int→double, long→float, long→double, float→double); reverse promotion is a SchemaCompatibilityErroravro-schema-equality-structural — Two Schema instances are equal if their canonical forms match, regardless of how they were constructed or whether they carry different defaultsavro-registry-id-prefix — SchemaRegistry prepends a 4-byte big-endian unsigned int schema ID to encoded payloads, with no magic byte (unlike Confluent's 5-byte header)