Original source: topic-ddia-chapter-10-batch-processing

Topic: The Unix philosophy of composable tools, sort-merge joins, and the distinction between streaming and materializing operators that this code demonstrates

Date: 2026-05-29

Time: 12:53

The Unix Philosophy in Batch and Stream Processing

Composable Tools: The Pipeline as a Chain of Generators

The batch-word-count/pipeline.py file is the clearest expression of the Unix philosophy in this codebase. Each Stage is a standalone transformation — it consumes an iterator and yields records, exactly like a Unix command consuming stdin and writing to stdout. The Pipeline class composes stages by nesting their generators:


Sort._tracked_process(Count._tracked_process(Tokenize._tracked_process(ReadLines._tracked_process(...))))

This is cat | tokenize | sort | uniq -c rendered as Python generators. Data flows record-by-record through the chain, pulled lazily from the tail — the Pipeline never builds a central data structure. The three execution modes (run(), runlazy(), runto_file()) in pipeline.py mirror the Unix choice between materializing output, piping it, or redirecting to a file.

The Pipeline.add_stage() method returns self for fluent builder-style chaining. But critically, there is no schema validation between stages — just as Unix pipes pass untyped byte streams, the pipeline passes untyped tuples. A Tokenize stage after a Sort would break at runtime if the tuple shape doesn't match, but nothing prevents you from building that pipeline.

The mapreduce-framework/mapreduce.py takes this further with MapReducePipeline, which chains multiple MapReduceJob stages where the output of stage N becomes the input of stage N+1 — the same composition pattern at a coarser granularity.

Sort-Merge Joins: Three Strategies, Same Result

map-side-join/mapsidejoins.py implements the three map-side join strategies from DDIA Chapter 10, each trading different assumptions for performance:

| Strategy | Assumption | Mechanism |

|----------|-----------|-----------|

| BroadcastHashJoin | One dataset fits in memory | Build hash table from small side, probe with large side |

| PartitionedHashJoin | Both sides co-partitioned on join key | Hash-partition both, local hash join per partition |

| SortMergeJoin | Both sides sorted by join key | Linear two-pointer merge |

The SortMergeJoin is the textbook sort-merge: it checks sort order with issorted(), sorts if needed, then walks two pointers forward. When duplicate keys appear, it collects all matching records from both sides and emits their cartesian product — every left record paired with every right record sharing that key.

The comparejoinstrategies() function is the pedagogical payoff: it runs all three strategies on the same data and verifies they produce identical results (after normalizing away mapperid tags). This demonstrates DDIA's point that these are different implementations of the same logical operation, each optimal under different data assumptions.

All three share the same field-conflict resolution via mergerecords(): the join key appears exactly once in the output, and any other shared field names get left/right prefixes.

Streaming vs. Materializing Operators

This is the deepest concept in the codebase, and it cuts across multiple modules. The key question is: does an operator need to see all its input before producing any output?

The Spectrum in `pipeline.py`

The batch pipeline's stages fall into three categories:

| Category | Stages | Memory Behavior |

|----------|--------|-----------------|

| Streaming | ReadLines, Tokenize, Filter, FlatMap, Partition | O(1) memory — yield immediately per input record |

| Partial materialization | TopN | O(N) memory — bounded heap, yields only at end |

| Full materialization | Count, Sort | O(input) memory — must see everything before yielding anything |

Count is explicitly a pipeline barrier: it accumulates all keys in a defaultdict before yielding any output, which forces every upstream stage to run to completion. Sort is the same — you cannot emit sorted output until you've seen the last record. These are the operators that break the Unix illusion of smooth record-by-record flow.

Sort in pipeline.py mitigates this with external merge sort: when the in-memory buffer exceeds memorylimit, it spills sorted chunks to disk as JSONL, then uses heapq.merge for k-way merge across chunks. The KeyedRecord wrapper provides stable ordering via a seq tiebreaker. This is the same technique MapReduce uses for its shuffle phase — and it's why the test in testpipeline.py verifies that external sort produces identical results to in-memory sort.

Batch vs. Stream: Two Worlds of Materialization

The mapreduce-framework/mapreduce.py represents full materialization between phases. Intermediate data is serialized as JSON files (map-{mapper_id}-part-{partition}.json) and written to disk. The shuffle phase reads all intermediate files for a partition, sorts by key, groups with itertools.groupby, then feeds to the reducer. Every phase boundary is a materialization barrier — this is the fundamental limitation of MapReduce that dataflow engines like Spark address.

In contrast, stream-join-processor/streamjoinprocessor.py represents continuous, incremental processing. The StreamJoinProcessor never waits for "all input" because in a stream there is no "all." Instead it uses:

Event-time watermarks — the highest timestamp seen — to decide when events are "expired"
Probe-on-arrival — when an event arrives, it's buffered and immediately probed against the opposite buffer (a nested-loop join scoped to matching keys)
Deferred miss emission — outer-join misses are emitted only at expiration (watermark - window_duration), not eagerly, because a match might still arrive

The stream processor's buffers (defaultdict(list) keyed by join key) are a form of materialization, but bounded — events expire and are removed. This is the fundamental tradeoff: batch operators materialize everything for correctness; stream operators bound materialization for latency, accepting that late events may be dropped.

The Unbundled Database: Composition Through Logs

The unbundled-database/unbundled_database.py synthesizes both paradigms. The WriteAheadLog is a streaming, append-only data source. The StorageEngine is a materializing consumer — it maintains the full current state. The CDCStream fans changes out to derived systems (SecondaryIndex, MaterializedView, FullTextSearch), each of which tracks its own position independently (analogous to Kafka consumer group offsets).

This is the Unix philosophy applied at the system level: each component has one job, communicates through a shared log (the "pipe"), and can be replaced independently. The DerivedSystem interface is the composable stage contract — implement processevent(), rebuild(), and getstate(), and you plug into the pipeline.

The snapshotandstream() method on CDCStream solves the "new consumer catch-up" problem by synthesizing insert events from current storage state, then subscribing for future events — a neat bridge between the materialized world (full state) and the streaming world (incremental changes).