File: avro-serializer/avro_serializer.py

Date: 2026-05-29

Time: 08:52

avro-serializer/avro_serializer.py

Purpose

This file is a from-scratch implementation of Apache Avro's binary serialization format, built to demonstrate the schema evolution concepts from DDIA Chapter 4 (Encoding and Evolution). It owns the entire encode/decode pipeline: schema parsing, binary serialization, schema resolution (reading data written with one schema using a different schema), compatibility checking, and a simple schema registry. This is not a wrapper around the avro library — it reimplements the wire format and resolution rules directly.

Key Components

Constants and Configuration

Zigzag / Varint Encoding (zigzagencode, zigzagdecode, writevarint, readvarint, writelong, readlong)

Avro encodes integers using zigzag encoding on top of variable-length encoding. Zigzag maps signed integers to unsigned (0→0, -1→1, 1→2, -2→3, ...) so small-magnitude negatives use few bytes. writelong/readlong compose zigzag with varint and are the workhorses — they encode every integer in the format: lengths, array counts, union indices, enum ordinals, and the int/long types themselves.

Schema

Parses an Avro-like schema definition (JSON-compatible Python dicts/lists/strings) into an internal representation. Validates structure eagerly in _parse:

The class exposes typed properties (fields, items, values, symbols, uniontypes, size) that return empty/None for inapplicable types. canonical_form() produces a hashable tuple representation for equality/hashing — two Schema objects with structurally identical definitions are equal regardless of construction path.

AvroEncoder

Serializes Python values to bytes given a writer schema. The encode method dispatches on schema.typename:

matchunion resolves which union branch to use via Python type inspection. It checks bool before int (since bool is an int subclass in Python), and for ambiguous dict values, prefers record if the keys match field names.

AvroDecoder

Reads binary data using a writer schema (what the data was written with) and resolves it against a reader schema (what the consumer expects). This is the core of schema evolution. The _decode method handles:

checkcompatibility / resolve_check

Performs a dry-run of schema resolution without any data, checking whether a reader can consume writer-produced data (backward compatibility) and vice versa (forward compatibility). Returns a dict with backwardcompatible, forwardcompatible, full_compatible, and a list of errors.

SchemaRegistry

A minimal in-memory registry that assigns auto-incrementing integer IDs to schemas. encodewithid prepends a 4-byte big-endian schema ID to the encoded payload; decodewithid strips it and looks up the writer schema. This models Confluent's Schema Registry pattern where the writer schema ID travels with the message.

Patterns

Dependencies

Imports: Only stdlib — io.BytesIO for buffer management, struct for IEEE 754 float encoding and the 4-byte schema ID header. No external dependencies.

Imported by: testavro.py and testavro_serializer.py — two test suites that exercise the serializer and schema evolution scenarios.

Flow

Encode path: caller creates Schema(definition) → creates AvroEncoder(schema) → calls encoder.encode(value) → recursive _encode walks the schema tree and writes bytes to a BytesIO buffer → returns bytes.

Decode path: caller creates AvroDecoder(writerschema, readerschema) → calls decoder.decode(data) → recursive _decode reads from a BytesIO buffer using the writer schema to know the wire layout, but returns values shaped by the reader schema.

Registry path: registry.encodewithid(sid, value) → encodes normally → prepends 4-byte ID. registry.decodewithid(data, reader_schema) → strips ID → looks up writer schema → decodes with resolution.

Invariants

1. Field order is wire order: Record fields are encoded/decoded in the order declared in the writer schema. Reordering fields in the schema changes the wire format.

2. No self-describing format: The binary output contains no type tags, field names, or schema metadata. You cannot decode without the writer schema.

3. Names must match for named types: Records, enums, and fixed types require matching name fields during resolution — you can't rename a type and maintain compatibility.

4. Promotion is one-directional: int→long works, long→int does not. The PROMOTIONS set defines all legal promotions.

5. Missing reader fields require defaults: During record resolution, if the reader has a field not present in the writer, it must have a default or resolution fails with SchemaCompatibilityError.

6. Union branch uniqueness: A union cannot contain two branches with the same type name (or record name), enforced at parse time.

7. Schema IDs are 4-byte big-endian unsigned: The registry prepends struct.pack('>I', schema_id) — schema IDs are limited to [0, 2^32).

Error Handling

Two custom exceptions partition the error space:

Errors are never swallowed. The compatibility checker catches SchemaCompatibilityError internally to report rather than raise, but the check functions always return a result.

Topics to Explore

Beliefs