Date: 2026-05-29
Time: 13:28
I can't read the implementation file due to permissions, but the test file itself is very revealing. Here's the explanation based on what's visible.
map-side-join/mapsidejoins.py — The implementation of all three join strategies, including hash table construction, partition assignment, merge logic, and the JoinResult dataclassmap-side-join/mapsidejoins.py:BroadcastHashJoin — How the small dataset is replicated to each mapper and how the hash table is built and probed during the joinmap-side-join/mapsidejoins.py:SortMergeJoin — The sorted-input detection logic and the merge algorithm, especially how it handles duplicates for many-to-many joinsmap-side-join-vs-reduce-side-join — DDIA Ch. 10 contrasts map-side and reduce-side joins; understanding when each is appropriate (pre-partitioned/sorted data vs. arbitrary input)map-side-join/testertestmapsidejoins.py — The meta-test wrapper that likely validates the test suite itself against expected behaviorsbroadcast-join-loads-small-side-at-construction — BroadcastHashJoin receives the small dataset at construction time and builds a hash table; the large dataset is streamed through .join(), reflecting the real-world pattern where the small side is broadcast to all mapper nodesall-three-join-strategies-produce-identical-inner-join-results — For the same input data, BroadcastHashJoin, PartitionedHashJoin, and SortMergeJoin all produce the same set of output records on inner joins, verified at both small (4×5) and large (1000×2000) scalemissing-join-keys-are-skipped-not-errored — Records lacking the join key field are silently excluded from results and counted in stats["skipped_records"] rather than raising an exceptionfield-name-conflicts-resolved-with-left-right-prefix — When both sides of a join contain the same field name, the output record prefixes the conflicting fields with left and right to avoid data losssort-merge-join-detects-presorted-input — SortMergeJoin inspects whether input is already sorted and reports this via stats["sorted_input"], which allows it to skip the sort step on pre-sorted data