File: unbundled-database/unbundled_database.py

Date: 2026-05-29

Time: 11:24

Purpose

This file implements the "database inside-out" pattern from DDIA Chapter 12 — the idea that a modern data system is really a composition of independent subsystems (log, storage, indexes, views) glued together by a change stream, rather than a monolithic engine. Each component is explicit and replaceable: a write-ahead log captures intent, a storage engine materializes state, a CDC stream propagates changes, and derived systems (secondary indexes, materialized views, full-text search) consume those changes independently.

It serves as a teaching implementation: all the pieces you'd find spread across Kafka, PostgreSQL, and Elasticsearch are composed in ~350 lines of Python so you can see the wiring.

Key Components

Data Classes

Core Infrastructure

Derived Systems (all extend DerivedSystem)

Facade

Patterns

Write-ahead logging: Every mutation hits the WAL before the storage engine. The WAL is the source of truth; storage is a materialized view of the log.

Event sourcing / CDC: The CDCStream converts low-level log entries into semantic change events. This is the "derived data" pattern from DDIA — the primary store and all secondary structures are different representations of the same underlying log.

Consumer position tracking: Each DerivedSystem independently tracks its position (last processed LSN). process_pending() iterates the full event list and skips events the consumer has already seen. This is analogous to Kafka consumer group offsets.

Snapshot-then-stream: CDCStream.snapshotandstream() synthesizes insert events from the current storage state, then subscribes the consumer for future events — solving the "how does a new consumer catch up" problem without replaying the full log.

Template method via ABC: DerivedSystem defines the interface contract; concrete implementations provide processevent(), rebuild(), and getstate().

Dependencies

Imports: Only stdlib — json, os, abc, dataclasses, typing. No external dependencies.

Imported by: The three test files (testunbundleddatabase.py, testertestunbundleddatabase.py, testtester_validation.py) exercise the system end-to-end.

Flow

A typical write flows:

1. UnbundledDatabase.put(key, value) captures the old value from storage

2. WAL.append("PUT", key, value) assigns an LSN and (optionally) persists to disk

3. StorageEngine.apply(entry) updates the in-memory store and advances its LSN

4. CDCStream.emit(entry, oldvalue) creates a CDCEvent (insert vs. update based on whether oldvalue existed) and appends it to the event log

5. The CDCEvent is returned to the caller — but derived systems haven't seen it yet

6. UnbundledDatabase.flush()CDCStream.process_pending() pushes pending events to all subscribed consumers based on each consumer's position

This lazy propagation model means writes are fast (only WAL + storage), and derived system updates are batched. The caller controls when propagation happens by calling flush().

Invariants

Error Handling

Minimal — this is a teaching implementation. Notable behaviors:

Topics to Explore

Beliefs