---
schema_version: "1.0"
project_name: "reasons"
updated_at: "2026-06-05T02:22:18+00:00"
node_count: 1405
generator: ftl-reasons/0.43.0
---

# Belief Registry
<!-- Generated by reasons export-markdown. Do not edit — operate through reasons. -->

## Claims

### 2pc-abort-guarantees-no-side-effects [IN] OBSERVATION
When any participant is unavailable, `execute()` returns `"aborted"` and no participant's state is modified, even for participants that were available and could have committed independently
- Source: entries/2026/05/29/two-phase-commit-test_2pc.md

### 2pc-blocking-window [IN] OBSERVATION
The coordinator logs `"committing"` before any participant receives the commit decision, creating a window where a crash leaves participants locked indefinitely in the `"prepared"` state with no authority to self-resolve.
- Source: entries/2026/05/29/topic-2pc-blocking-problem.md

### 2pc-blocking-window-is-between-prepare-and-decision [IN] OBSERVATION
A participant that has voted YES in `prepare()` holds locks and cannot unilaterally commit or abort until it receives the coordinator's decision, creating an unbounded blocking window if the coordinator fails
- Source: entries/2026/05/29/topic-three-phase-commit.md

### 2pc-coordinator-recover-requires-liveness [IN] OBSERVATION
`Coordinator.recover()` can only re-send decisions to participants that are currently available (`is_available()` check), leaving unavailable participants still in-doubt with no resolution path
- Source: entries/2026/05/29/topic-three-phase-commit.md

### 2pc-coordinator-recovery-replays-commit [IN] OBSERVATION
A coordinator with a `"committing"` log entry will re-send commit decisions to all participants during `recover()`, ensuring that a crash after the commit decision but before delivery still completes the transaction
- Source: entries/2026/05/29/two-phase-commit-test_2pc.md

### 2pc-has-design-implementation-gap [IN] OBSERVATION
Two-phase commit has a systematic gap between its safety design and implementation: the coordinator has a known blocking window between logging and sending the decision, timeouts are accepted as a parameter but never enforced, and recovery requires all participants to be available — defeating the protocol's availability goals.

### 2pc-lock-ownership-guard [IN] OBSERVATION
`Participant.abort()` only releases locks where `self.locks.get(op["key"]) == tx_id`, preventing one transaction's abort from releasing another transaction's lock
- Source: entries/2026/05/29/two-phase-commit-two_phase_commit.md

### 2pc-locks-held-during-in-doubt [IN] OBSERVATION
Locks acquired during `prepare()` are only released by `commit()` or `abort()`, meaning an in-doubt transaction blocks all subsequent transactions on the same keys indefinitely until the coordinator recovers
- Source: entries/2026/05/29/topic-three-phase-commit.md

### 2pc-locks-released-on-both-paths [IN] OBSERVATION
Locks held by a transaction are released regardless of whether the transaction commits or aborts, preventing deadlocks across sequential transactions
- Source: entries/2026/05/29/two-phase-commit-test_2pc.md

### 2pc-log-before-decision [IN] OBSERVATION
The coordinator appends `"committing"` or `"aborting"` to its log before broadcasting the decision to participants, enabling crash recovery via `recover()` to re-send decisions that were never delivered
- Source: entries/2026/05/29/two-phase-commit-two_phase_commit.md

### 2pc-participant-recover-returns-in-doubt-only [IN] OBSERVATION
`Participant.recover()` identifies transactions stuck in `"prepared"` state but provides no mechanism to resolve them without the coordinator, making it a diagnostic tool rather than a recovery procedure
- Source: entries/2026/05/29/topic-three-phase-commit.md

### 2pc-recovery-single-pass [IN] OBSERVATION
`Coordinator.recover()` scans the log once and re-sends decisions to currently-available participants; it does not retry or schedule follow-ups, so a participant still down at recovery time stays locked until `recover()` is manually called again.
- Source: entries/2026/05/29/topic-2pc-blocking-problem.md

### 2pc-result-dict-not-exceptions [IN] OBSERVATION
The 2PC protocol communicates all outcomes (commits, aborts, lock conflicts, unavailability) via `{"outcome": "committed"|"aborted", "reason": ...}` return dicts; no method raises exceptions
- Source: entries/2026/05/29/two-phase-commit-test_2pc.md

### 2pc-safety-gaps-compound-under-synchronous-simulation [IN] OBSERVATION
Two-phase commit's design-implementation gaps (known blocking window with no timeout enforcement, recovery requiring participant availability) are validated only under synchronous simulation where messages arrive instantly, meaning the real-world impact of coordinator crashes during the blocking window — where participants hold locks indefinitely — remains untested.

### 2pc-timeout-unused [IN] OBSERVATION
The `timeout` parameter is accepted by `Coordinator.__init__` but never referenced in any logic; availability is modeled via boolean `_available` flags on participants instead of actual time-based timeouts
- Source: entries/2026/05/29/two-phase-commit-two_phase_commit.md

### 2pc-unanimous-vote [IN] OBSERVATION
A transaction commits only if every participant in `participant_operations` votes `"yes"`; a single `"no"` vote triggers abort for all participants regardless of how many voted yes
- Source: entries/2026/05/29/two-phase-commit-two_phase_commit.md

### 2pc-uses-key-level-locking [IN] OBSERVATION
Participants lock at key granularity during `prepare()`; a second transaction touching a locked key receives a `"no"` vote and aborts rather than waiting or queuing
- Source: entries/2026/05/29/two-phase-commit-test_2pc.md

### abort-is-status-change-not-disk-rollback [IN] OBSERVATION
Aborting a transaction in both MVCC and SSI implementations sets a status flag (`_aborted` set or status marker); no disk writes are reversed because uncommitted data never reached disk
- Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

### aborted-writes-universally-invisible [IN] OBSERVATION
A version created by an aborted transaction is invisible to every transaction, and a deletion by an aborted transaction is ignored — aborted transactions' effects are completely erased from all snapshots.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-_is_visible.md

### add-node-incremental-ring-mutation [IN] OBSERVATION
`add_node` mutates the ring after each vnode insertion, so subsequent vnodes in the same loop see predecessors from earlier iterations, preventing overlapping transfer arcs.
- Source: entries/2026/05/29/topic-transfer-map-accuracy.md

### add-node-mutation-cost-is-quadratic-in-ring-size [IN] OBSERVATION
Each `add_node` call performs `vnode_count` list insertions into a sorted list, each O(total ring entries), making node addition O(V × N×V) in the worst case.
- Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

### add-node-same-owner-guard [IN] OBSERVATION
The `old_owner != node_id` guard at `consistent_hashing.py:38` skips transfer recording when a new vnode's successor is another vnode of the same node, avoiding double-counting already-claimed arcs.
- Source: entries/2026/05/29/topic-transfer-map-accuracy.md

### advance-time-injects-external-watermark [IN] OBSERVATION
`advance_time(timestamp)` lets an external orchestrator push the watermark forward without receiving an actual event, enabling idle-stream handling and explicit trigger semantics; it rejects timestamps at or below the current watermark
- Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

### all-fsync-sites-data-integrity-only [IN] OBSERVATION
None of the 13 `os.fsync()` call sites have callers that depend on mtime or ctime metadata being accurate; all syncs exist purely for data durability, making every site a valid `fdatasync` candidate
- Source: entries/2026/05/29/topic-fdatasync-vs-fsync-tradeoffs.md

### all-three-implementations-use-payload-only-crc [IN] OBSERVATION
The WAL, Bitcask, and B-tree storage engine all compute CRC over data payloads only, excluding their respective header/framing metadata fields — a consistent pattern across the repo
- Source: entries/2026/05/28/topic-wal-crc-coverage-gap.md

### all-to-all-converges-in-one-round [IN] OBSERVATION
`ALL_TO_ALL` topology delivers every pending change to every other node in a single `sync()` call, achieving convergence in one round when no custom-merge cascades occur.
- Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

### allocate-page-no-initialization [IN] OBSERVATION
`allocate_page` returns a page number containing whatever stale data was previously on disk; the caller must overwrite it with valid page content before the next WAL commit
- Source: entries/2026/05/28/b-tree-storage-engine-btree-allocate_page.md

### allocate-page-prefers-free-list [IN] OBSERVATION
`allocate_page` always recycles from the free list before extending the file with a new page, keeping the data file compact after deletions
- Source: entries/2026/05/28/b-tree-storage-engine-btree-allocate_page.md

### anti-entropy-counts-per-node-writes [IN] OBSERVATION
The repair count returned by `anti_entropy_repair()` reflects individual node writes, not unique keys repaired; one stale key across N nodes counts as N repairs
- Source: entries/2026/05/29/leaderless-replication-dynamo-anti_entropy_repair.md

### anti-entropy-detects-but-cannot-fully-resolve-divergence [IN] OBSERVATION
Anti-entropy can precisely locate divergent key ranges via Merkle tree diffs but cannot fully reconcile them: tombstone semantics differ at every layer (empty-bytes sentinel in LSM, preserved-by-default in merge, replication-convergence-dependent in distributed), preventing consistent cross-replica resolution of deleted keys.

### anti-entropy-does-not-advance-versions [IN] OBSERVATION
`anti_entropy_repair()` propagates existing `VersionedValue` entries without modifying `_version_counters`, so it never creates new versions — only copies the highest existing version to lagging replicas
- Source: entries/2026/05/29/leaderless-replication-dynamo-anti_entropy_repair.md

### anti-entropy-highest-version-wins [IN] OBSERVATION
Anti-entropy repair assumes a total ordering on versions where the highest version number is authoritative, which is safe because `put()` uses a global per-key version counter ensuring no two writes produce the same version
- Source: entries/2026/05/29/leaderless-replication-dynamo-anti_entropy_repair.md

### anti-entropy-layers-are-decoupled [IN] OBSERVATION
The gossip, merkle-tree, vector-clock, read-repair, and hinted-handoff modules are independent implementations with no cross-imports; composing them into a full Dynamo-style anti-entropy pipeline is left to the integrator
- Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

### anti-entropy-repair-skips-unavailable [IN] OBSERVATION
Both `ReadRepairStore.anti_entropy_repair()` and `DynamoCluster.anti_entropy_repair()` only scan and repair currently available replicas; a down replica must rely on a future anti-entropy run after it recovers.
- Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

### anti-entropy-skips-unavailable-nodes [IN] OBSERVATION
Anti-entropy repair only reads from and writes to nodes where `is_available` is true; down nodes are excluded from both key discovery and repair propagation
- Source: entries/2026/05/29/leaderless-replication-dynamo-anti_entropy_repair.md

### anti-entropy-unions-all-keys [IN] OBSERVATION
`anti_entropy_repair()` collects keys from all available nodes via set union, so a key present on any single node is propagated to all other nodes regardless of whether it was intentionally removed elsewhere
- Source: entries/2026/05/29/topic-leaderless-deletion-gap.md

### apfs-masks-linux-fsync-bugs [IN] OBSERVATION
macOS APFS provides implicit rename durability through its CoW transaction model, so the missing directory fsync is a latent bug that only surfaces on Linux filesystems (ext4, XFS)
- Source: entries/2026/05/28/topic-directory-fsync-semantics.md

### append-batch-no-rollback [IN] OBSERVATION
If an exception occurs mid-batch in `append_batch` (e.g., disk I/O failure in `_persist_event`), already-appended events remain in `_events`, `_streams`, and on disk with no rollback — the process continues with inconsistent in-memory state
- Source: entries/2026/05/29/topic-event-sourcing-concurrency-model.md

### append-batch-shares-references [IN] OBSERVATION
Returned `Event` objects from `append_batch` are the same instances stored in `_events`; no defensive copy is made for either the event or its `data` dict, so caller mutation corrupts store state.
- Source: entries/2026/05/29/event-sourcing-store-event_store-append_batch.md

### append-batch-version-check-is-precondition [IN] OBSERVATION
`append_batch` checks `expected_version` once before the write loop; it does not re-check after each event, so the guard is a pre-condition gate that rejects the entire batch cleanly on mismatch, not a per-event invariant
- Source: entries/2026/05/29/topic-event-sourcing-concurrency-model.md

### apply-byzantine-is-outbound-only [IN] OBSERVATION
`_apply_byzantine` is applied exclusively to outgoing messages; incoming messages are never filtered through it, so a Byzantine node's internal state (prepared/committed sets) remains consistent even while it sends corrupted messages
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_apply_byzantine.md

### apply-remote-change-advances-lamport-clock [IN] OBSERVATION
`apply_remote_change` always advances the node's Lamport clock to `max(local_clock, remote_ts) + 1` before any conflict resolution, ensuring subsequent local writes have causally-later timestamps than any received change.
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-apply_remote_change.md

### apply-remote-change-no-merge-fn-guard [IN] OBSERVATION
`apply_remote_change` does not validate that `merge_fn` is non-None when strategy is `CUSTOM_MERGE`; passing `None` raises `TypeError` at call time rather than a descriptive error.
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-apply_remote_change.md

### async-index-defers-mutations [IN] OBSERVATION
With `async_index=True`, `TermPartitionedDB` queues index operations in `_pending` instead of applying them immediately; `flush_index()` must be called to drain the queue, modeling asynchronous global index updates.
- Source: entries/2026/05/29/secondary-index-partitioning-secondary_index_partitioning.md

### auto-commit-not-default [IN] OBSERVATION
`Consumer` defaults to `auto_commit=False` (line 182), requiring explicit `commit()` calls; at-least-once semantics via auto-commit are opt-in, not the default behavior
- Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

### avro-block-encoding-zero-terminated [IN] OBSERVATION
Arrays and maps are encoded as a sequence of count-prefixed blocks terminated by a 0-count block; counts and elements use zigzag+varint via `write_long`/`read_long`
- Source: entries/2026/05/29/topic-avro-block-encoding.md

### avro-encoder-positive-counts-only [IN] OBSERVATION
The `AvroEncoder` always writes positive counts for array/map blocks, never negative counts with byte-size prefixes, trading skip efficiency for encoder simplicity
- Source: entries/2026/05/29/topic-avro-block-encoding.md

### avro-encoding-prioritizes-compactness-over-self-description [IN] OBSERVATION
Avro's binary encoding systematically prioritizes compactness over self-description: the format contains no type tags or field names (requiring both schemas for decoding), arrays use zero-terminated blocks for streaming, and negative block counts enable O(1) skipping — all optimizations that assume schema availability.

### avro-enum-default-fallback [IN] OBSERVATION
When a writer sends an enum symbol not in the reader's symbol list, the reader falls back to the `default` symbol declared in the reader schema rather than raising an error
- Source: entries/2026/05/29/avro-serializer-test_avro.md

### avro-int-bounds-enforced [IN] OBSERVATION
The Avro `int` type enforces 32-bit signed integer range ([-2^31, 2^31-1]) at encode time via `ValueError`, while `long` accepts values at least as large as 2^40
- Source: entries/2026/05/29/avro-serializer-test_avro_serializer.md

### avro-missing-reader-field-requires-default [IN] OBSERVATION
During record resolution, a reader field absent from the writer schema must have a default value; missing defaults cause SchemaCompatibilityError, making defaults the mechanism for forward/backward compatibility.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-negative-count-carries-byte-size [IN] OBSERVATION
A negative block count `-N` in Avro's spec means N elements follow, preceded by a byte-size long that enables O(1) skipping of the entire block without parsing elements
- Source: entries/2026/05/29/topic-avro-block-encoding.md

### avro-no-default-sentinel [IN] OBSERVATION
_NO_DEFAULT is a sentinel object distinguishing "no default provided" from a None/null default; this matters because null is a valid Avro default value, and conflating the two would break canonical form computation and default-filling during resolution.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-no-self-describing-format [IN] OBSERVATION
Avro binary encoding contains no type tags, field names, or schema metadata; decoding is impossible without the writer schema.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-promotions-one-directional [IN] OBSERVATION
Type promotion during schema resolution is strictly one-directional (int→long→float→double, int→float, int→double, long→double); reverse promotion (e.g., long→int) is not supported and raises SchemaCompatibilityError.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-record-fields-wire-order [IN] OBSERVATION
Record fields are encoded and decoded in declaration order of the writer schema with no field names on the wire; reordering fields in the schema definition changes the binary format and breaks existing data.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-registry-4byte-schema-id [IN] OBSERVATION
SchemaRegistry prepends a 4-byte big-endian unsigned int schema ID (struct.pack('>I', id)) to encoded payloads, modeling Confluent's Schema Registry wire format; schema IDs are limited to [0, 2^32).
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-schema-canonical-equality [IN] OBSERVATION
`Schema("int")` and `Schema({"type": "int"})` are equal and hash-equal, enabling consistent set/dict usage regardless of whether the schema was constructed from a string or dict form
- Source: entries/2026/05/29/avro-serializer-test_avro_serializer.md

### avro-schema-error-vs-compatibility-error [IN] OBSERVATION
SchemaError is raised at parse time for structurally invalid definitions (a valid Schema object is guaranteed well-formed); SchemaCompatibilityError is raised during decode resolution or compatibility checking when two schemas cannot be reconciled.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-skip-is-recursive-not-bytejump [IN] OBSERVATION
The decoder's `_skip` method discards unwanted fields by recursively walking the writer schema rather than using byte-size jumps, which works correctly but requires understanding the element schema
- Source: entries/2026/05/29/topic-avro-block-encoding.md

### avro-union-disambiguation [IN] OBSERVATION
When encoding a Python dict into a union containing both a record and a map type, the encoder resolves ambiguity by matching dict keys against record field names first, falling back to map only if no record matches
- Source: entries/2026/05/29/avro-serializer-test_avro_serializer.md

### avro-union-no-duplicate-types [IN] OBSERVATION
A union schema cannot contain two branches with the same type name (or record name); this constraint is enforced at schema parse time in Schema._parse.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### avro-writer-reader-dual-schema-resolution [IN] OBSERVATION
AvroDecoder always requires both a writer schema (to parse the wire bytes) and a reader schema (to shape the output); this dual-schema resolution is Avro's core mechanism for schema evolution.
- Source: entries/2026/05/29/avro-serializer-avro_serializer.md

### batch-append-not-crash-safe [IN] OBSERVATION
`append_batch` can leave a partial batch on disk if the process crashes mid-write, since events are individually appended to the NDJSON file without a transaction marker or write-ahead log
- Source: entries/2026/05/29/event-sourcing-store-event_store.md

### batch-mode-defers-fsync [IN] OBSERVATION
In `batch` sync mode, individual WAL appends do not fsync until `_write_count` reaches `_batch_sync_count` (default 100), leaving up to 99 records vulnerable to crash loss
- Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

### batch-notifies-after-all-stored [IN] OBSERVATION
`append_batch()` stores all events in the batch first, then notifies subscribers of each event sequentially, ensuring subscribers see a consistent state where all batch events exist
- Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

### batch-write-count-not-reset-on-forced-sync [IN] OBSERVATION
The WAL `_write_count` counter only resets when the batch threshold triggers a sync (line 133), not when a forced sync occurs, which may cause counter drift between forced and threshold-triggered syncs
- Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

### binary-formats-rigid-across-entire-storage-stack [IN] OBSERVATION
The entire storage stack uses rigid binary formats that preclude both forward evolution and post-corruption recovery: WAL records are contiguously packed with no block alignment or version negotiation preventing resync after mid-file corruption, and SSTables lack per-entry checksums and efficient skip structures — neither layer can be upgraded in place or self-repaired after partial damage.

### bitcask-auto-compact-threshold [IN] OBSERVATION
When the number of frozen segments exceeds `auto_compact_threshold` (default 5), compaction is triggered automatically during `put()` operations
- Source: entries/2026/05/28/log-structured-hash-table-test_bitcask.md

### bitcask-compact-bypasses-write-record [IN] OBSERVATION
`compact()` re-serializes records directly to the output file rather than calling `_write_record`, duplicating the binary serialization logic in a separate code path
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_write_record.md

### bitcask-compact-not-crash-safe [IN] OBSERVATION
Hash-index-storage compaction is non-atomic with no error handling: a crash between deleting old files and completing the rewrite can leave the store in an unrecoverable state with no rollback mechanism.
- Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

### bitcask-compact-preserves-timestamps [IN] OBSERVATION
Compaction rewrites records with their original timestamps (not the time of compaction), preserving timestamp-based ordering across merges.
- Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

### bitcask-compact-skips-active-file-keys [IN] OBSERVATION
During compaction, keys whose `keydir` entry points to the active file are excluded from merge output; their immutable-file copies are treated as stale because the active file holds a newer value.
- Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

### bitcask-compact-writes-hint-files [IN] OBSERVATION
Compaction produces `.hint` files alongside each merged data file (including mid-compaction when a merged file hits `max_file_size`), enabling O(keys) index rebuilds instead of O(records) full scans on next startup.
- Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

### bitcask-compaction-not-crash-safe [IN] OBSERVATION
Compaction performs delete-then-rename without journaling; a crash mid-compaction can leave the store in a state that `_recover()` cannot reconstruct correctly
- Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

### bitcask-compaction-preserves-observable-state [IN] OBSERVATION
After `compact()`, every key returns the same value as before compaction and deleted keys remain absent; compaction changes only physical layout, never logical state.
- Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

### bitcask-compaction-renumbers-active [IN] OBSERVATION
After compaction, merged files get IDs above the old active file, and the active file is renamed to an even higher ID to maintain the monotonically increasing file ID invariant.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-crash-recovery-without-hints [IN] OBSERVATION
`BitcaskStore` can rebuild its in-memory index by scanning `.data` files alone when `.hint` files are missing, producing identical read results to a clean startup with hint files present.
- Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

### bitcask-crc-covers-payload-only [IN] OBSERVATION
CRC32 is computed over `key_bytes + value_bytes` only; the 12-byte header (which contains the CRC itself) is excluded from the checksum input
- Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

### bitcask-crc-validated-on-read-only [IN] OBSERVATION
In the log-structured hash table, CRC32 validation occurs only during `get()` and `_scan_segment()`; writes compute and store the CRC but never verify it back, so a corrupted write is not caught until read time
- Source: entries/2026/05/29/topic-bitcask-binary-format.md

### bitcask-crc32-raises-corruption-error [IN] OBSERVATION
Reading a record whose payload does not match its CRC32 header raises `CorruptionError` — the store never silently returns corrupt data
- Source: entries/2026/05/29/log-structured-hash-table-tester_test_bitcask.md

### bitcask-delete-nonexistent-is-noop [IN] OBSERVATION
Calling `delete()` on a key that doesn't exist in the store completes without error.
- Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

### bitcask-file-rotation-transparent [IN] OBSERVATION
File rotation triggered by `max_file_size` is invisible to readers — all keys remain retrievable across multiple `.data` files because the keydir maps each key to its specific `(file_id, offset)`.
- Source: entries/2026/05/29/hash-index-storage-tester_test_bitcask.md

### bitcask-fsync-per-record-default [IN] OBSERVATION
`sync_writes` defaults to `True`, meaning every `_write_record` call triggers an `fsync` — durable by default at significant write throughput cost.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-get-missing-returns-none [IN] OBSERVATION
`BitcaskStore.get()` returns `None` for missing or deleted keys, never raises `KeyError`.
- Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

### bitcask-get-no-file-existence-guard [IN] OBSERVATION
If a segment file is deleted out-of-band or by a crash during compaction, `get()` raises an unhandled `FileNotFoundError` rather than returning `None` or a graceful error
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-get.md

### bitcask-get-opens-fresh-handle [IN] OBSERVATION
`get()` opens a new file handle per call rather than using the `_file_handles` cache, making the handle cache effectively unused for reads
- Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

### bitcask-get-single-disk-seek [IN] OBSERVATION
`get()` performs exactly one disk seek per call; the key-to-offset mapping is resolved entirely in memory via `self._index`, making reads O(1) index lookup plus one disk read
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-get.md

### bitcask-header-is-12-bytes [IN] OBSERVATION
The log-structured hash table's `HEADER_FMT = "!III"` produces a fixed 12-byte header (three big-endian uint32s: CRC, key_size, value_size) at `log-structured-hash-table/bitcask.py:10-11`; total record size is always `12 + len(key) + len(value)`
- Source: entries/2026/05/29/topic-bitcask-binary-format.md

### bitcask-hint-files-exclude-tombstones [IN] OBSERVATION
`create_hint_files()` skips tombstone records and only emits entries whose index currently points to that segment+offset, so hint-based recovery cannot distinguish "key was deleted" from "key never existed in this segment"
- Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

### bitcask-hint-files-skip-scan [IN] OBSERVATION
When a `.hint` file exists for a file ID, `_rebuild_index` loads it instead of scanning the data file; hint files have no checksum validation, so they must be written atomically with their data files during compaction.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-keydir-is-sole-index [IN] OBSERVATION
The in-memory `keydir` dict is the only index; every live key has exactly one `KeyEntry` pointing to its most recent non-tombstone record on disk.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-no-checksum-validation [IN] OBSERVATION
Records have no CRC or integrity checksum; the only corruption guard is the `assert read_key == key` in `get()`, which catches index/data mismatches but not bit-rot.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-no-mmap [IN] OBSERVATION
Neither Bitcask implementation uses memory-mapped I/O; both use standard `open()`/`read()` with manual buffering, forgoing kernel page cache optimizations that production Bitcask implementations rely on for startup performance
- Source: entries/2026/05/29/topic-bitcask-startup-cost.md

### bitcask-no-parallel-rebuild [IN] OBSERVATION
Both Bitcask implementations process data/hint files strictly sequentially during keydir rebuild with no threading, multiprocessing, or async I/O, even though hint file loading is embarrassingly parallel
- Source: entries/2026/05/29/topic-bitcask-startup-cost.md

### bitcask-no-range-query-support [IN] OBSERVATION
Neither Bitcask implementation has any range query, ordered iteration, or prefix scan capability; the hash-based keydir only supports exact-key point lookups
- Source: entries/2026/05/29/topic-bitcask-vs-lsm-tree.md

### bitcask-no-validation-on-segment-filenames [IN] OBSERVATION
If a file matching `segment_*.dat` contains non-numeric characters in the ID position, `_find_existing_segments` crashes with `ValueError`; no validation guards against malformed filenames
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_find_existing_segments.md

### bitcask-partial-write-safe [IN] OBSERVATION
On startup recovery, the store skips incomplete records at segment tail (header present but payload truncated) without raising errors or losing previously committed data
- Source: entries/2026/05/28/log-structured-hash-table-test_bitcask.md

### bitcask-reader-not-threadsafe [IN] OBSERVATION
File handles in `self.file_handles` are shared and mutable (via `seek`), so concurrent calls to `_read_record` on the same `file_id` would race.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_read_record.md

### bitcask-rebuild-sorts-ascending [IN] OBSERVATION
`_rebuild_index` sorts file IDs ascending before scanning so that newer records overwrite older ones in the keydir, enforcing last-write-wins semantics.
- Source: entries/2026/05/28/hash-index-storage-bitcask-_rebuild_index.md

### bitcask-record-format-symmetry [IN] OBSERVATION
`_read_record` and `_write_record` share the same binary layout (`<dII` header + key bytes + value bytes); a change to either must be mirrored in the other.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_read_record.md

### bitcask-recovery-order-is-ascending [IN] OBSERVATION
Recovery scans segments in ascending ID order so that newer writes overwrite older index entries, and the highest-ID segment becomes the active segment for appending
- Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

### bitcask-rename-is-rotation-not-safety [IN] OBSERVATION
The `os.rename` calls in both Bitcask implementations rename active segments to frozen segments for namespace management, not for crash-safe atomic file creation — the file is already populated at its original path before the rename
- Source: entries/2026/05/29/topic-atomic-write-via-temp-rename.md

### bitcask-rotation-is-soft-cap [IN] OBSERVATION
File rotation is checked before each write, so the active file can exceed `max_file_size` by up to one record's worth of bytes; the limit is a soft cap, not a hard one
- Source: entries/2026/05/29/hash-index-storage-bitcask-_maybe_rotate.md

### bitcask-rotation-preserves-read-handles [IN] OBSERVATION
When the active file is rotated, its read handle in `file_handles` is preserved so existing `KeyEntry` references pointing to the old file remain valid
- Source: entries/2026/05/29/hash-index-storage-bitcask-_maybe_rotate.md

### bitcask-scan-short-payload-undetected [IN] OBSERVATION
In `_scan_data_file`, a truncated header safely breaks the scan loop, but a short key/value read after a valid header is not detected — the truncated data is silently used, potentially corrupting the keydir.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_scan_data_file.md

### bitcask-segment-discovery-is-filesystem-based [IN] OBSERVATION
`_find_existing_segments` uses `os.listdir` as the source of truth for which segments exist; there is no persistent manifest file — the filesystem is the manifest
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_find_existing_segments.md

### bitcask-segment-naming-must-sync [IN] OBSERVATION
The filename parsing in `_find_existing_segments` (prefix/suffix slicing to extract segment ID) must stay in sync with the format string in `_segment_path`; no shared constant enforces this coupling
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_find_existing_segments.md

### bitcask-sequential-file-ids [IN] OBSERVATION
File IDs are assigned by incrementing `active_file_id` by 1 with no gap detection or collision check; compaction updates `active_file_id` to avoid collisions after renumbering
- Source: entries/2026/05/29/hash-index-storage-bitcask-_maybe_rotate.md

### bitcask-single-active-writer [IN] OBSERVATION
Exactly one data file is open for writes at any time; all other files are immutable and read-only.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-tests-disable-sync [IN] OBSERVATION
All tests in `test_bitcask.py` pass `sync_writes=False`, meaning the durable fsync-per-write code path is untested.
- Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

### bitcask-tombstone-handling-asymmetry [IN] OBSERVATION
`_scan_data_file` removes tombstoned keys from `keydir` during rebuild, but `_load_hint_file` unconditionally inserts entries — it relies on compaction having already stripped tombstones before writing the hint file.
- Source: entries/2026/05/28/hash-index-storage-bitcask-_rebuild_index.md

### bitcask-tombstone-invisible-to-get [IN] OBSERVATION
`get()` never encounters tombstone records because `delete()` removes the key from `self._index`; tombstone handling is solely a recovery and compaction concern
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-get.md

### bitcask-tombstone-is-empty-string [IN] OBSERVATION
Deletion is represented by writing a record with an empty-string value; both `_scan_data_file` (checks `val_size == 0`) and `get` (checks `value == ""`) use this convention.
- Source: entries/2026/05/28/hash-index-storage-bitcask.md

### bitcask-tombstone-semantics [IN] OBSERVATION
`BitcaskStore.delete()` writes a tombstone record (`b"__BITCASK_TOMBSTONE__"`); `get()` returns `None` for tombstoned keys, and `compact()` removes both the original entry and the tombstone
- Source: entries/2026/05/28/log-structured-hash-table-test_bitcask.md

### bitcask-write-record-no-index-update [IN] OBSERVATION
`_write_record` only appends to disk and returns a byte offset; it never modifies `self._index`, leaving index management entirely to callers (`put`, `delete`, `compact`)
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_write_record.md

### bloom-and-counting-share-hash-function [IN] OBSERVATION
BloomFilter and CountingBloomFilter use the identical `_hashes()` double-hashing scheme (MD5-based, lines 9-14), producing the same bit/counter positions for identical parameters — so a BloomFilter and CountingBloomFilter with the same `m` and `k` will agree on membership queries for the same insertions.
- Source: entries/2026/05/29/topic-cuckoo-filters.md

### bloom-compaction-rebuild-not-patch [IN] OBSERVATION
For SSTable compaction, rebuilding a fresh `BloomFilter` during the merge write pass is simpler and more correct than incrementally patching a `CountingBloomFilter` with `remove()`; compaction already pays O(N) I/O so filter construction adds only constant-factor overhead
- Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

### bloom-double-hashing [IN] OBSERVATION
`_hashes()` uses Kirschner-Mitzenmacher double hashing from a single MD5 digest, producing k positions as `(h1 + i*h2) % m` without k independent hash functions
- Source: entries/2026/05/28/topic-bloom-filter-integration.md

### bloom-filter-deterministic-hashing [IN] OBSERVATION
The Bloom filter hash function is deterministic (not seeded with randomness), so two filters with identical parameters produce identical bit arrays for identical inputs — enabling equality comparison and reproducible serialization
- Source: entries/2026/05/29/bloom-filter-test_bloom_filter.md

### bloom-filter-double-hashing [IN] OBSERVATION
`BloomFilter` generates k bit positions via Kirschner-Mitzenmacker double hashing: one MD5 digest is split into two 64-bit values h1 and h2, then positions are computed as `(h1 + i*h2) % m` rather than using k independent hash functions.
- Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

### bloom-filter-len-counts-insertions [IN] OBSERVATION
`BloomFilter.__len__` counts total `add()` calls, not distinct items — adding the same item twice yields `len() == 2`, unlike `CountingBloomFilter` which has the same property tracked separately
- Source: entries/2026/05/29/bloom-filter-test_bloom_filter.md

### bloom-filter-monotonic-bits [IN] OBSERVATION
Bits in `BloomFilter` are only ever set, never cleared; this monotonic property makes the structure append-only, enables union via bitwise OR, and prevents deletion without a counting layer.
- Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

### bloom-filter-no-false-negatives [IN] OBSERVATION
After `BloomFilter.add(x)`, `x in filter` always returns `True`; the data structure guarantees zero false negatives by design (all k bit positions are set on add and all must be set for membership).
- Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

### bloom-filter-not-integrated [IN] OBSERVATION
Neither LSM tree implementation (`lsm.py` or `sstable.py`) references or uses the bloom filter module; they exist as independent DDIA concept demonstrations with zero cross-module imports
- Source: entries/2026/05/28/topic-bloom-filter-integration.md

### bloom-filter-optimal-sizing-textbook [IN] OBSERVATION
`BloomFilter` computes `bit_count` as `ceil(-n * ln(p) / ln(2)^2)` and `hash_count` as `round((m/n) * ln(2))`, matching the standard optimal formulas from Bloom filter literature
- Source: entries/2026/05/29/bloom-filter-test_bloom_filter.md

### bloom-filter-union-count-overestimates [IN] OBSERVATION
`BloomFilter.union()` sets `_count = self._count + other._count`, which overstates distinct items when filters share elements; `estimate_count()` from bit density is more accurate for unioned filters
- Source: entries/2026/05/29/bloom-filter-bloom_filter.md

### bloom-filter-union-requires-identical-params [IN] OBSERVATION
`BloomFilter.union()` raises `ValueError` if the two filters differ in bit array size (m) or hash count (k), since bitwise OR is only meaningful when bit positions map to the same hash space.
- Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

### bloom-hashes-no-m-zero-guard [IN] OBSERVATION
`_hashes` does not guard against `m=0`, which causes an unhandled `ZeroDivisionError` in the `(h1 + i * h2) % m` computation
- Source: entries/2026/05/29/bloom-filter-bloom_filter-_hashes.md

### bloom-hashes-positions-not-unique [IN] OBSERVATION
The `k` positions returned by `_hashes` are not guaranteed distinct — duplicate positions can occur when `m` is small relative to `k`, and all callers (add, contains, remove) tolerate this via idempotent bit/counter operations
- Source: entries/2026/05/29/bloom-filter-bloom_filter-_hashes.md

### bloom-md5-split-64bit [IN] OBSERVATION
`_hashes` splits the 128-bit MD5 digest at the midpoint into two 64-bit little-endian unsigned integers `h1` and `h2`, used as base position and step size for double hashing
- Source: entries/2026/05/29/bloom-filter-bloom_filter-_hashes.md

### bloom-serialization-footer-ready [IN] OBSERVATION
`BloomFilter.to_bytes` packs a 12-byte header (`m`, `k`, `count` as three little-endian uint32s) followed by the raw bit array, matching the pattern used by SSTable footers for sparse indices
- Source: entries/2026/05/28/topic-bloom-filter-integration.md

### both-storage-paradigms-hit-scalability-walls [IN] OBSERVATION
Both storage paradigms in the reference implementations exhibit fundamental scalability constraints: the hash index requires all keys in RAM (making dataset size directly bound by available memory with no spill-to-disk fallback), while the LSM tree scans every SSTable on negative lookups because the correctly-implemented Bloom filter module is never wired into the read path.

### both-suites-discoverable-by-pytest [IN] OBSERVATION
Both `tester_test_*.py` and `test_*.py` naming conventions match pytest's default collection patterns (`test_*.py` and `*_test_*.py`), so `pytest` in any project directory runs the combined suite — **note: this contradicts existing belief `tester-naming-avoids-pytest-discovery`; one should be retracted**
- Source: entries/2026/05/29/topic-tester-test-pattern.md

### broadcast-hash-join-requires-small-side-in-memory [IN] OBSERVATION
`BroadcastHashJoin` loads the entire small dataset into a hash table at construction time via `_build_hash_table`, so the small side must fit in a single mapper's memory
- Source: entries/2026/05/29/topic-ddia-ch10-reduce-side-joins.md

### broadcast-join-loads-small-side-at-construction [IN] OBSERVATION
`BroadcastHashJoin` receives the small dataset at construction time and builds a hash table; the large dataset is streamed through `.join()`, reflecting the asymmetric API where the small side must be available upfront
- Source: entries/2026/05/29/map-side-join-test_map_side_joins.md

### btree-allocate-page-outside-wal [IN] OBSERVATION
Page allocation during splits modifies metadata through `PageManager.write_meta` (not WAL-logged), creating a potential crash-safety gap if the process fails between allocation and the subsequent WAL commit
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_insert.md

### btree-bisect-left-for-leaf-exact-match [IN] OBSERVATION
Leaf lookups use `bisect_left` to find the leftmost insertion point, then check `keys[idx] == key` for exact match; `bisect_right` would return the position after all equal keys, overshooting the target.
- Source: entries/2026/05/29/topic-bisect-right-vs-left-in-btrees.md

### btree-bisect-left-leaf-bisect-right-internal [IN] OBSERVATION
`_search` uses `bisect_left` at leaf nodes for exact-match lookup and `bisect_right` at internal nodes for child routing; this asymmetry matches the split convention where the median key is copied up and retained as the first key of the right leaf.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_search.md

### btree-counters-reset-per-operation [IN] OBSERVATION
`pm.reset_counters()` is called at the start of each public method, so `stats.pages_read/written` reflects only the most recent operation.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-crc32-skips-corrupt-entries [IN] OBSERVATION
Corrupted WAL entries (CRC mismatch) are silently skipped during B-tree recovery rather than raising an error or halting replay.
- Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

### btree-delete-causes-monotonic-degradation [IN] OBSERVATION
B-tree deletion is structurally degrading: empty leaf pages are never freed, freed pages leave dangling parent pointers, and tree height only increases — causing monotonic space amplification and search-path lengthening over delete-heavy workloads.

### btree-delete-cleanup-depth2-only [IN] OBSERVATION
Empty leaf cleanup only triggers when the current internal node is at depth 2 (direct parent of leaves) and the empty child is not the leftmost; empty leaves at depth > 2 or at index 0 are silently left in place
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

### btree-delete-is-immediate-not-tombstone [IN] OBSERVATION
Deletes decrement `total_keys` immediately and the key becomes unreadable at once; no tombstone entries are written and no deferred compaction pass is needed
- Source: entries/2026/05/28/topic-btree-split-and-merge-strategy.md

### btree-delete-leaks-pages [IN] OBSERVATION
`BTree._delete` removes keys from leaves but never calls `free_page` on empty leaves, causing unbounded page file growth despite `PageManager` having a free list mechanism.
- Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

### btree-delete-metadata-reread [IN] OBSERVATION
`delete` re-reads metadata after `_delete` returns to pick up `next_free`/`free_head` changes made by `pm.free_page`, rather than threading updated values through the recursive call stack.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-delete.md

### btree-delete-no-commit-on-miss [IN] OBSERVATION
When the key is not found, `delete` writes no WAL entries and performs no commit, making failed deletes purely read-only operations.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-delete.md

### btree-delete-skips-free-page [IN] OBSERVATION
The delete path does not call `free_page` even when a leaf becomes completely empty; `PageManager`'s free list is used only by `allocate_page` after explicit `free_page` calls from other paths.
- Source: entries/2026/05/28/topic-b-tree-underflow-rebalancing.md

### btree-delete-three-valued-return [IN] OBSERVATION
`_delete` returns `False` (not found), `True` (deleted, no cleanup needed), or the string `'empty'` (leaf now empty, parent should attempt cleanup); the public `delete()` collapses this to `bool`
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

### btree-depth-param-not-validated [IN] OBSERVATION
`_search` trusts that the `depth` parameter from metadata accurately reflects the tree's balance; corrupted height causes silent wrong-format deserialization (leaf data parsed as internal or vice versa) with no error.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_search.md

### btree-deserialize-no-validation [IN] OBSERVATION
Both _deserialize_leaf and _deserialize_internal perform no validation of page type, sort order, or buffer length; they trust that WAL CRC checks and the serialize functions guarantee well-formed pages, so corruption produces silent wrong results rather than errors.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_deserialize_leaf.md

### btree-double-fsync-per-mutation [IN] OBSERVATION
B-tree mutations pay for `os.fsync()` twice: once when writing the WAL entry (`btree.py:137`) and again when committing the page to the data file (`btree.py:105`), with the WAL truncated only after the data file sync confirms durability.
- Source: entries/2026/05/29/topic-fsync-vs-fdatasync.md

### btree-durability-protects-data-not-structure [IN] OBSERVATION
The B-tree's durability model protects user data but not structural integrity: mutations pay double fsync for data pages (WAL entry + data write) while structural metadata is never fsynced, AND structural integrity silently erodes during normal operation (leaked pages, ever-growing height, dangling parent pointers after free_page) with no defensive checks in the I/O layer to detect or prevent degradation.

### btree-empty-leaf-freed-on-delete [IN] OBSERVATION
When all keys are removed from a non-root leaf, the page is returned to the free list via `PageManager.free_page()` rather than persisting as an empty node
- Source: entries/2026/05/28/topic-btree-split-and-merge-strategy.md

### btree-file-never-shrinks [IN] OBSERVATION
The B-tree data file only grows (via `next_free` bump in `allocate_page`) and never shrinks; no truncate, compact, or defrag operation exists anywhere in the module.
- Source: entries/2026/05/29/topic-free-list-fragmentation.md

### btree-fix-plan-all-crash-safety [IN] OBSERVATION
All six bugs documented in `fix-plan.md` concern single-writer WAL/fsync crash safety; none involve concurrent access races, illustrating that even without concurrency the WAL protocol has subtle correctness requirements
- Source: entries/2026/05/29/topic-latch-coupling-postgres.md

### btree-free-list-intrusive [IN] OBSERVATION
Freed pages store the "next free" pointer in their own body at `HEADER_SIZE` offset, forming an intrusive singly-linked list with no separate allocator data structure.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-free-page-bypasses-wal [IN] OBSERVATION
`pm.free_page()` called during deletion writes directly to the data file without WAL logging, creating a crash-consistency gap if the process fails between the free and the WAL commit
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

### btree-free-page-no-parent-fixup [IN] OBSERVATION
`PageManager.free_page` adds a page to the free list without removing the parent's downlink to it, creating a dangling-pointer risk on crash
- Source: entries/2026/05/29/topic-postgres-nbtree-half-dead-state-machine.md

### btree-has-page-io-counters [IN] OBSERVATION
The B-tree's `PageManager` tracks `pages_read` and `pages_written` with a `reset_counters()` method (multiply by `page_size` for bytes), but WAL writes are not counted by any existing counter
- Source: entries/2026/05/29/topic-write-amplification-measurement.md

### btree-height-never-shrinks [IN] OBSERVATION
Since delete never merges internal nodes or demotes the root, tree height only increases (on splits) and never decreases, regardless of how many keys are deleted.
- Source: entries/2026/05/28/topic-b-tree-underflow-rebalancing.md

### btree-impl-asymmetric-split-vs-merge [IN] OBSERVATION
The reference implementation fully propagates splits upward through `_insert`'s return value but swallows the `'empty'` signal in `_delete` after one level of cleanup, making insertion structurally complete but deletion structurally incomplete
- Source: entries/2026/05/28/topic-postgres-nbtree-lazy-deletion.md

### btree-insert-no-wal-commit [IN] OBSERVATION
`_insert` never calls `wal.commit()`; the caller (`put`) is solely responsible for committing after all page writes and metadata updates complete, making the entire `put` atomic with respect to crash recovery
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_insert.md

### btree-internal-split-promotes-key [IN] OBSERVATION
Internal node splits promote the midpoint key into the parent (removing it from the child), unlike leaf splits which copy it (the key remains in the right child); this asymmetry allows uniform `bisect_right` routing at all internal levels.
- Source: entries/2026/05/29/topic-bisect-right-vs-left-in-btrees.md

### btree-keys-decoded-as-utf8 [IN] OBSERVATION
Leaf deserialization always decodes keys as UTF-8 strings while values remain raw bytes; invalid UTF-8 in key bytes causes UnicodeDecodeError at read time, not write time (since serialize accepts both str and bytes).
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_deserialize_leaf.md

### btree-leaf-sibling-chain [IN] OBSERVATION
Leaves store a `next_sibling` pointer forming a left-to-right chain terminated by `NO_SIBLING` (0xFFFFFFFF), enabling range scans without touching internal nodes.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-leaf-sibling-chain-forward-only [IN] OBSERVATION
Leaf pages store a `next_sibling` forward pointer (`NO_SIBLING = 0xFFFFFFFF` sentinel) but no backward pointer, so unlinking a leaf from the sibling chain requires locating the predecessor via parent traversal rather than direct backlink.
- Source: entries/2026/05/29/topic-postgres-nbtree-vacuum.md

### btree-leaf-sibling-patched-on-remove [IN] OBSERVATION
When an empty leaf is removed during deletion, `_delete` patches the previous sibling's `next_sibling` pointer to splice the empty leaf out of the linked list used by `range_scan` and `__iter__`
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

### btree-leaf-split-copies-key [IN] OBSERVATION
Leaf splits copy the midpoint key into both the parent and the right leaf (key remains searchable in the leaf), while internal splits promote the midpoint key out of both children into the parent only
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_insert.md

### btree-lookup-reads-bounded-by-height [IN] OBSERVATION
A single `BTree.get()` reads at most `height + 1` disk pages, verified via `pm.pages_read` after counter reset.
- Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

### btree-max-keys-forces-splits [IN] OBSERVATION
With `max_keys_per_page=4`, inserting a 5th key into a leaf triggers a page split; tests use this small fanout to exercise multi-level tree structure with few keys.
- Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

### btree-metadata-bypasses-wal [IN] OBSERVATION
`BTree._write_meta` writes root pointer and free list head directly to the data file without logging through the WAL, making metadata updates not crash-safe.
- Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

### btree-mutation-fsync-is-asymmetric [IN] OBSERVATION
B-tree mutations pay double fsync costs for user data (WAL entry + data page) but skip fsync entirely for structural metadata, creating an asymmetry where key-value pairs survive crashes but the free-page list and allocation state may not.

### btree-no-underflow-rebalancing [IN] OBSERVATION
Delete does not rebalance underfilled nodes; the only cleanup is freeing empty non-leftmost leaves at depth 2 — a deliberate simplification over production B-trees.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-page-alignment-isolates-corruption [IN] OBSERVATION
The B-tree's fixed-size pages (`page_num * page_size` addressing) provide natural resync boundaries for the data file — corruption of one page does not affect reads of other pages, unlike the streaming WAL formats
- Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

### btree-page-io-has-no-defensive-checks [IN] OBSERVATION
The B-tree's page I/O layer lacks defensive validation at every stage: deserialization accepts any bytes, leaf-finding doesn't verify node types, and page writes silently truncate oversized data — a single corrupted byte can cascade through the tree undetected.

### btree-page-overwrite-no-size-change [IN] OBSERVATION
B-tree PageManager overwrites pages at fixed offsets within a pre-allocated file, so most write+sync cycles do not change file size and would benefit from `fdatasync` skipping metadata I/O
- Source: entries/2026/05/29/topic-fdatasync-vs-fsync-tradeoffs.md

### btree-page-writes-unbuffered-by-app [IN] OBSERVATION
`PageManager` has no application-level page cache — every `read_page` reads from the file and every `write_page` writes immediately, relying entirely on the kernel page cache for performance
- Source: entries/2026/05/29/topic-o-direct-and-dio.md

### btree-page-zero-is-metadata [IN] OBSERVATION
Page 0 is always the metadata page (root_page, height, total_keys, next_free_page, free_list_head); the root page is never page 0.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-partial-split-replay-orphans-right-page [IN] OBSERVATION
If a crash occurs after the right page and left page are WAL-logged but before the parent update, recovery produces a consistent sibling chain but the right page is unreachable from tree descent — range scans find the keys but point lookups silently miss them
- Source: entries/2026/05/29/topic-split-and-sibling-update-atomicity.md

### btree-put-is-upsert [IN] OBSERVATION
`BTree.put()` on an existing key updates the value in-place without incrementing `len(tree)` — it is an upsert, not a blind insert.
- Source: entries/2026/05/28/b-tree-storage-engine-tester_test_btree.md

### btree-put-metadata-reread [IN] OBSERVATION
`put` must re-read metadata after `_insert` returns because `PageManager.allocate_page` mutates `next_free` and `free_head` as a side effect during splits, and those changes are not threaded back through the return value.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-put.md

### btree-recursive-insert-returns-union [IN] OBSERVATION
`_insert` returns a discriminated union: `None` (updated existing), `'inserted'` (new key, no split), or `(mid_key, new_page)` (split happened); root splits are handled in `put()`, not inside the recursion.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-sibling-chain-fixed-offset [IN] OBSERVATION
The `next_sibling` field sits at a fixed byte offset (bytes 3–6) in every serialized leaf page, making the sibling chain walkable without fully deserializing page contents
- Source: entries/2026/05/29/topic-split-during-insert.md

### btree-sibling-chain-maintained-across-mutations [IN] OBSERVATION
The B-tree sibling chain is actively maintained during all structural mutations with a fixed wire format: splits write the new right page with the old leaf's next_sibling pointer before rewriting the old leaf, deletions patch the previous sibling's pointer to splice out the removed leaf, and the next_sibling field sits at a fixed byte offset (bytes 3-6) enabling reliable traversal without full page deserialization.

### btree-sibling-pointers-unused-for-merge [IN] OBSERVATION
Leaf pages serialize a `next_sibling` pointer (via `_serialize_leaf` and `NO_SIBLING` sentinel) providing infrastructure for merge/redistribute, but no code path reads sibling pointers during deletion.
- Source: entries/2026/05/28/topic-b-tree-underflow-rebalancing.md

### btree-single-file-avoids-dir-fsync-gap [IN] OBSERVATION
The B-tree's PageManager writes to a single pre-existing data file opened at construction, so normal operations never create new files and avoid the directory fsync gap that affects segment-based engines during rotation and compaction.
- Source: entries/2026/05/29/topic-fsync-guarantees.md

### btree-single-threaded-assumption [IN] OBSERVATION
The B-tree implementation has no locking, latching, or concurrency control; all crash-safety gaps exist even without concurrent access
- Source: entries/2026/05/29/topic-postgres-nbtree-half-dead-state-machine.md

### btree-single-writer [IN] OBSERVATION
`put` (and `delete`) assume single-threaded access: metadata is re-read after `_insert`/`_delete` without any locking or compare-and-swap, creating a TOCTOU window under concurrent writers.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-put.md

### btree-split-write-order-protects-chain [IN] OBSERVATION
During a leaf split, `_insert` writes the new right page to the WAL before the modified left page, ensuring the sibling pointer target is durable before any pointer references it — so the chain never points to a nonexistent page after crash recovery
- Source: entries/2026/05/29/topic-split-and-sibling-update-atomicity.md

### btree-stdlib-only [IN] OBSERVATION
The B-tree module uses only stdlib imports (os, struct, zlib, bisect) with zero external dependencies.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-structural-integrity-silently-erodes [IN] OBSERVATION
B-tree structural integrity silently erodes over time: deletions cause monotonic degradation (leaked pages, ever-growing height) while the I/O layer performs no validation to detect or report the resulting inconsistencies.

### btree-survives-close-reopen [IN] OBSERVATION
Data written before `BTree.close()` is fully recoverable by constructing a new `BTree` on the same directory, including updates and deletes.
- Source: entries/2026/05/28/b-tree-storage-engine-tester_test_btree.md

### btree-traversal-is-forward-only [IN] OBSERVATION
B-tree sequential access is strictly forward-only: iteration descends the left spine (following children at index zero at every internal level) to reach the leftmost leaf, then walks the forward-only sibling chain, with no backward pointer, reverse iterator, or random leaf access mechanism.

### btree-uses-bisect-not-linear-scan [IN] OBSERVATION
In-node key lookup uses `bisect_left`/`bisect_right` (binary search), making within-node search O(log B) rather than the textbook O(B) linear scan — shifting the optimal branching factor higher than classical B-tree analysis predicts.
- Source: entries/2026/05/29/topic-index-size-vs-scan-cost-benchmarking.md

### btree-wal-before-data [IN] OBSERVATION
Every page mutation is logged to the WAL (with fsync) before being written to the data file; recovery replays the WAL, so no committed write is lost on crash.
- Source: entries/2026/05/28/b-tree-storage-engine-btree.md

### btree-wal-checksum-excludes-page-num [IN] OBSERVATION
In `b-tree-storage-engine/btree.py`, the WAL entry checksum covers only `page_data`, so a corrupted `page_num` during recovery writes a valid page to the wrong disk location without detection — the most dangerous integrity gap in the codebase
- Source: entries/2026/05/29/topic-crc-coverage-audit.md

### btree-wal-entry-includes-full-page [IN] OBSERVATION
Each B-tree WAL entry is `16 + page_size` bytes (12-byte header with seq/page_num/data_len, full page image, 4-byte CRC32), making WAL writes the dominant cost for small key-value updates
- Source: entries/2026/05/29/topic-write-amplification-measurement.md

### btree-wal-fsync-per-entry [IN] OBSERVATION
Each `WAL.log_write()` call performs an `fsync`, guaranteeing that WAL entries are durable before any data page writes proceed
- Source: entries/2026/05/29/topic-split-atomicity-via-wal.md

### btree-wal-is-physical-redo-only [IN] OBSERVATION
The B-tree WAL in `btree.py` stores full after-image page data (physical WAL) and supports only redo recovery; there is no undo capability
- Source: entries/2026/05/29/topic-write-ahead-logging-fsync-ordering.md

### btree-wal-no-data-fsync [IN] OBSERVATION
`PageManager.write_page` calls `flush()` but never `os.fsync()`, so written pages may not be durable on crash even after the WAL has committed.
- Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

### btree-wal-no-operation-boundaries [IN] OBSERVATION
The WAL logs individual page writes and commits by full truncation; it cannot represent a multi-step structural operation as an atomic unit
- Source: entries/2026/05/29/topic-postgres-nbtree-half-dead-state-machine.md

### btree-wal-provides-split-atomicity [IN] OBSERVATION
Multi-page operations (splits, deletes that free pages) are made crash-safe by writing all page modifications to the WAL before applying them to the data file
- Source: entries/2026/05/28/topic-btree-split-and-merge-strategy.md

### btree-wal-replay-is-idempotent [IN] OBSERVATION
Because the B-tree WAL stores complete page images, replaying entries already applied to the data file produces identical results, making the crash window after `sync()` but before `truncate()` harmless
- Source: entries/2026/05/29/topic-write-ahead-logging-fsync-ordering.md

### btree-wal-replays-without-commit [IN] OBSERVATION
The B-tree's WAL replays all valid (CRC-passing) entries on recovery regardless of whether a commit marker is present — uncommitted entries are applied without distinction.
- Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

### btree-wal-three-phase-commit [IN] OBSERVATION
The B-tree WAL commit follows a strict three-phase protocol: (1) write WAL entries + fsync, (2) fsync data file via `page_manager.sync()`, (3) truncate WAL + fsync WAL fd — crash-safe at every interleaving because WAL entries are physical page images making replay idempotent.
- Source: entries/2026/05/29/topic-crash-recovery-invariants.md

### btree-wal-truncate-before-data-sync [IN] OBSERVATION
`WAL.commit()` truncates the WAL before the data file is fsynced, creating a crash window where both the log and the data can be lost.
- Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

### btree-wal-truncate-is-safe [IN] OBSERVATION
The B-tree's WAL `commit()` only truncates after fsyncing the data file, making truncation idempotent — a crash mid-truncation simply replays the WAL entries again on recovery
- Source: entries/2026/05/28/topic-truncation-crash-safety.md

### btree-wal-truncation-is-commit-point [IN] OBSERVATION
The WAL truncation in `commit()` is the atomic linearization point: before truncation, recovery will redo the transaction; after truncation, the transaction is committed
- Source: entries/2026/05/29/topic-split-atomicity-via-wal.md

### btree-wal-uses-full-page-images [IN] OBSERVATION
The WAL logs complete page images (physical logging) rather than logical operations, which eliminates incomplete-split states during recovery but increases write amplification compared to PostgreSQL's logical WAL entries.
- Source: entries/2026/05/29/topic-postgres-nbtree-readme.md

### btree-wal-uses-trailer-crc [IN] OBSERVATION
The B-tree WAL places its CRC32 checksum as a trailer after `page_data`, while the standalone WAL and Bitcask embed CRC in the record header — a structural difference that affects torn-write detection behavior
- Source: entries/2026/05/29/topic-crc-scope-comparison-across-implementations.md

### buffer-bounded-by-window-plus-lateness [IN] OBSERVATION
Join buffer size is bounded by events within `window.duration` of the current watermark; `_expire_events` garbage-collects everything below that cutoff on every event arrival or `advance_time` call
- Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

### bully-assumes-full-connectivity [IN] OBSERVATION
The Bully Algorithm requires any node to message any other node directly via `all_node_ids`; there is no ring or tree overlay topology
- Source: entries/2026/05/29/topic-bully-vs-ring-election.md

### bully-candidate-half-timeout [IN] OBSERVATION
Candidates wait `election_timeout // 2` for ALIVE responses before deciding, shorter than the follower's full `election_timeout`, to avoid cascading election delays.
- Source: entries/2026/05/29/leader-election-leader_election.md

### bully-cascade-causes-quadratic-messages [IN] OBSERVATION
Receiving an ELECTION from a lower-ID node triggers both an ALIVE response and a new `start_election` call, creating the O(n²) message cascade in worst-case elections
- Source: entries/2026/05/29/topic-bully-vs-ring-election.md

### bully-coordinator-is-broadcast [IN] OBSERVATION
`declare_victory` sends a COORDINATOR message to every other node in the cluster (O(n) messages), not just to nodes that participated in the election
- Source: entries/2026/05/29/topic-bully-vs-ring-election.md

### bully-election-terms-monotonic [IN] OBSERVATION
Election terms in `get_election_history()` are enforced to be monotonically non-decreasing across successive elections, validated by `test_terms_increase_monotonically`
- Source: entries/2026/05/29/leader-election-test_leader_election.md

### bully-highest-id-wins [IN] OBSERVATION
The Bully Algorithm always elects the highest available node ID as leader; `start_election` only sends ELECTION to higher-ID nodes, and `declare_victory` only fires when no higher node responds.
- Source: entries/2026/05/29/leader-election-leader_election.md

### bully-recovery-triggers-election [IN] OBSERVATION
When a failed node recovers via `recover_node`, it immediately starts an election, potentially preempting the current leader — this is the defining "bully" behavior.
- Source: entries/2026/05/29/leader-election-leader_election.md

### bully-split-brain-resolved-post-hoc [IN] OBSERVATION
Split-brain is detected and resolved by `_resolve_split_brain` in the cluster harness after each tick (forcing lower-ID leaders to re-elect), not prevented by the node-level protocol alone.
- Source: entries/2026/05/29/leader-election-leader_election.md

### bully-synchronous-delivery [IN] OBSERVATION
All messages generated within a single `tick()` call are delivered and fully resolved (including cascading responses) in a `while all_messages` loop before the tick returns.
- Source: entries/2026/05/29/leader-election-leader_election.md

### bully-tests-use-deterministic-simulation [IN] OBSERVATION
Leader election tests control time via an explicit `start_time` parameter to `run_until_leader()` rather than real timers or sleeps, making the entire test suite deterministic with no flaky timing
- Source: entries/2026/05/29/leader-election-test_leader_election.md

### byzantine-mode-is-immutable-after-construction [IN] OBSERVATION
The `byzantine_mode` field is set in `__init__` and never modified during protocol execution, so a node's fault behavior cannot change mid-protocol — simplifying reasoning about safety guarantees
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_apply_byzantine.md

### catch-up-then-subscribe [IN] OBSERVATION
`LiveProjection` implements the catch-up subscription pattern: replay historical events via `read_all(from_position=...)`, then switch to push-based updates for future events, bridging pull/push at a known position with no gap
- Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

### catch-up-uses-event-id-filtering [IN] OBSERVATION
`catch_up()` calls `read_all(from_position=self._position + 1)` to skip already-processed events, but `_on_event` has no equivalent position guard — an asymmetry that enables duplicate processing
- Source: entries/2026/05/29/topic-catch-up-subscription-gap.md

### catch-up-uses-position-plus-one [IN] OBSERVATION
`Projection.catch_up()` calls `read_all(from_position=self._position + 1)`, so loading a snapshot sets the exact boundary — only events after the snapshot's position are replayed, making reconstruction cost proportional to events-since-snapshot
- Source: entries/2026/05/29/topic-event-sourcing-snapshots.md

### cbf-8x-memory-vs-standard [IN] OBSERVATION
`CountingBloomFilter` uses one byte per counter position (`bytearray(self._m)`) versus one bit per position in `BloomFilter`, an 8x memory overhead for deletion support
- Source: entries/2026/05/29/topic-memtable-bloom-filter-lifecycle.md

### cbf-count-tracks-calls-not-cardinality [IN] OBSERVATION
`CountingBloomFilter.__len__` returns `add` calls minus `remove` calls, not distinct element count; adding the same item twice and removing once leaves `len == 1`
- Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

### cbf-one-byte-per-counter [IN] OBSERVATION
Each `CountingBloomFilter` counter uses a full byte in a `bytearray` regardless of `counter_bits`, using 8x more memory than a bit-packed representation
- Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

### cbf-remove-can-introduce-false-negatives [IN] OBSERVATION
Unlike standard Bloom filters which never have false negatives, `CountingBloomFilter.remove()` can cause false negatives when two items share a hash position — removing one decrements the shared counter, potentially making the other appear absent
- Source: entries/2026/05/29/bloom-filter-bloom_filter-remove.md

### cbf-remove-false-accept [IN] OBSERVATION
`CountingBloomFilter.remove` can silently succeed for an item never added if all its hash positions have non-zero counters from other items, corrupting the filter state
- Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

### cbf-remove-raises-on-absent-item [IN] OBSERVATION
`CountingBloomFilter.remove()` raises `ValueError` if any hash position has a zero counter, providing a safety check against removing items that were never added
- Source: entries/2026/05/29/topic-memtable-bloom-filter-lifecycle.md

### cbf-remove-two-pass-atomicity [IN] OBSERVATION
`CountingBloomFilter.remove()` validates all hash positions have non-zero counters in a first pass before decrementing any in a second pass, preventing partial state corruption when removing an absent item
- Source: entries/2026/05/29/bloom-filter-bloom_filter-remove.md

### cbf-saturated-counters-are-permanent [IN] OBSERVATION
When a `CountingBloomFilter` counter reaches `_max_val` (default 15), it is never decremented, preventing false negatives at the cost of a slight false positive rate increase
- Source: entries/2026/05/29/topic-memtable-bloom-filter-lifecycle.md

### cbf-saturated-counters-never-decrement [IN] OBSERVATION
When a `CountingBloomFilter` counter reaches `_max_val` (default 15 for 4-bit counters), it is permanently frozen: neither `add` nor `remove` will change it
- Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

### cdc-and-event-sourcing-share-projection-pattern [IN] OBSERVATION
`CDCConsumer`/`MaterializedView` in `cdc.py` and `Projection` in `event_store.py` implement the same structural pattern — track position, pull new events, apply handlers — against different event sources (database mutations vs. domain events)
- Source: entries/2026/05/29/topic-cqrs-read-models.md

### cdc-backbone-both-heuristic-and-insufficient [IN] OBSERVATION
The CDC backbone that all derived systems depend on has two independent reliability gaps: event type semantics are determined by reconstruction heuristics rather than explicit markers (insert vs update distinguished by old_value presence, tombstones reported as None, snapshots use sentinel sequence numbers), AND the consistency requirements of derived systems (explicit flush to make mutations visible, old values for index maintenance) aren't reliably met by the CDC infrastructure.

### cdc-before-after-contract [IN] OBSERVATION
Every `ChangeEvent` follows a strict before/after convention: INSERT has `before=None`, DELETE has `after=None`, UPDATE has both populated as full row copies, making events self-contained without querying the source database.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-consumer-poll-is-at-least-once [IN] OBSERVATION
If a CDC consumer crashes mid-`poll()`, the position is not advanced, so all events from that batch will be reprocessed on the next call — providing at-least-once delivery with no deduplication.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-consumer-position-is-next-sequence [IN] OBSERVATION
A CDC consumer's `_position` tracks the *next* sequence number to read (not the last one read); after processing, it advances to `last_event.sequence_number + 1`.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-consumer-position-is-volatile [IN] OBSERVATION
CDCConsumer tracks its read position as an in-memory integer (`_position`) with no durable offset storage, meaning position is lost on process restart
- Source: entries/2026/05/29/topic-cdc-flush-semantics.md

### cdc-determines-insert-vs-update-by-old-value [IN] OBSERVATION
The CDC layer distinguishes `insert` from `update` events by checking whether `old_value is None`; the WAL only knows `PUT` and `DELETE`, so the semantic enrichment happens at the CDC boundary
- Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

### cdc-event-semantics-depend-on-reconstruction-heuristics [IN] OBSERVATION
CDC event semantics are determined by reconstruction heuristics rather than explicit markers: insert vs update is distinguished by checking whether old_value is None, tombstones are reported as None values indistinguishable from missing data in conflict records, and snapshots use a sentinel sequence number (-1) to distinguish bootstrapped state from real events.

### cdc-insert-update-distinction [IN] OBSERVATION
Insert vs. update is determined solely by whether `CDCStream.emit()` receives a non-None `old_value`; the WAL itself only records `PUT` operations for both cases
- Source: entries/2026/05/29/unbundled-database-unbundled_database.md

### cdc-log-compact-keeps-latest-per-key [IN] OBSERVATION
CDCLog.compact() retains only the most recent event per (table, key) pair, matching Kafka log compaction semantics but without concurrent-reader safety
- Source: entries/2026/05/29/topic-cdc-flush-semantics.md

### cdc-log-sequence-numbers-survive-compaction [IN] OBSERVATION
`CDCLog.compact()` preserves the original sequence numbers of surviving events; it does not renumber them, so consumer positions remain valid references after compaction.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-materialized-view-transform-none-deletes [IN] OBSERVATION
When a `MaterializedView`'s transform function returns `None` for an INSERT or UPDATE event, the row is removed from the view rather than stored, acting as a filter on the derived table.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-old-value-required-for-index-consistency [IN] OBSERVATION
`SecondaryIndex.process_event` depends on `CDCEvent.old_value` to remove stale index entries during updates; without before-images, incremental index maintenance produces phantom references
- Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

### cdc-search-index-full-reindex-on-update [IN] OBSERVATION
`SearchIndex` removes all old tokens and re-adds all new tokens on every UPDATE event, even if only non-indexed columns changed — correct but not optimized for partial changes.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-snapshot-uses-sentinel-sequence-number [IN] OBSERVATION
`create_snapshot` assigns `sequence_number=-1` to all synthetic INSERT events, distinguishing bootstrapped state from real log entries and enabling new consumers to load current state then seek to the live tail.
- Source: entries/2026/05/29/change-data-capture-cdc.md

### cdc-write-and-log-are-synchronously-coupled [IN] OBSERVATION
Every CDCDatabase mutation (insert/update/delete) writes to in-memory state and appends to CDCLog in the same synchronous call with no failure boundary between them
- Source: entries/2026/05/29/topic-cdc-flush-semantics.md

### ch-md5-masked-to-32-bits [IN] OBSERVATION
The consistent hash ring's `_hash` function truncates MD5's 128-bit output to 32 bits via `& 0xFFFFFFFF`, matching the `RING_SIZE` of 2^32
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

### ch-preference-list-skips-duplicates [IN] OBSERVATION
`get_nodes` walks clockwise past virtual nodes belonging to already-seen physical nodes, ensuring exactly `replication_factor` distinct physical nodes in the preference list
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

### ch-ring-sorted-invariant [IN] OBSERVATION
`ConsistentHashRing._ring_positions` is maintained in sorted order at all times via `bisect`; all key lookups depend on this invariant
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

### ch-transfer-maps-are-descriptive [IN] OBSERVATION
`add_node` and `remove_node` return transfer map descriptions of which arcs move between nodes but do not move data themselves; the caller must act on the map
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

### ch-vnode-count-is-weighted [IN] OBSERVATION
A consistent hash ring node with weight `w` gets `int(num_vnodes * w)` virtual nodes, so weight=0.5 produces half the default ring presence
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

### chained-checksums-trade-random-access-for-log-integrity [IN] OBSERVATION
Chained frame checksums (as in SQLite's WAL) prevent mid-log splicing, deletion, and reordering but require sequential validation from the chain start, eliminating the ability to validate or replay a single record in isolation
- Source: entries/2026/05/29/topic-cumulative-checksums.md

### checkpoint-empty-payload [IN] OBSERVATION
Checkpoint records are encoded with zero-length key and value fields; the record's semantic role is carried entirely by its `OP_CHECKPOINT` type and sequence number
- Source: entries/2026/05/29/write-ahead-log-wal-checkpoint.md

### checkpoint-filtered-from-replay [IN] OBSERVATION
`replay()` filters out CHECKPOINT records (along with COMMIT records), returning only PUT and DELETE — checkpoint records serve as truncation boundaries, not data operations visible to recovery consumers
- Source: entries/2026/05/29/write-ahead-log-wal-checkpoint.md

### checkpoint-record-not-used-in-recovery [IN] OBSERVATION
`OP_CHECKPOINT` records are written to the WAL but recovery (`_recover_seq_num`) does not search for them; the caller must supply the checkpoint sequence number externally via `replay(after_seq=)`
- Source: entries/2026/05/29/topic-log-compaction-vs-persistence.md

### checkpoint-record-unused-for-truncation [IN] OBSERVATION
The WAL supports `OP_CHECKPOINT` records that could serve as a durable truncation watermark (skip records below the watermark during replay), but `truncate()` physically removes records instead of using this mechanism
- Source: entries/2026/05/29/topic-crash-safety-of-truncate.md

### checksum-mask-is-python2-compat [IN] OBSERVATION
The `& 0xFFFFFFFF` mask in `WAL._checksum` is a Python 2 portability idiom — Python 3's `zlib.crc32` already returns unsigned values, making the mask technically redundant but kept as defensive code
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_checksum.md

### client-holds-stale-token-after-lock-expiry [IN] OBSERVATION
The `Client._held_tokens` dict is never cleared on lock expiry — the client retains and may use a token whose corresponding lock has already been acquired by another client
- Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

### code-expert-has-no-codegen [IN] OBSERVATION
The code-expert workflow is read-only against the target repository — it scans, explores, and extracts beliefs but never generates or modifies source files
- Source: entries/2026/05/29/topic-code-expert-generation-pipeline.md

### codebase-uses-no-steal-policy [IN] OBSERVATION
All implementations use NO-STEAL buffer management: uncommitted transaction data is never written to the persistent data store, eliminating the need for undo logging at the cost of requiring all dirty data from active transactions to fit in memory
- Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

### combiner-does-not-affect-map-output-stats [IN] OBSERVATION
`map_output_records` is incremented before the combiner runs (line 108), so stats reflect pre-combiner volume; the test at line 68 asserts this explicitly
- Source: entries/2026/05/29/topic-combiner-correctness.md

### combiner-not-type-checked [IN] OBSERVATION
The combiner's callable signature matches the reducer's, but no runtime or static check verifies associativity or commutativity — a non-associative combiner (e.g., average) produces silently wrong results that vary with `num_mappers`
- Source: entries/2026/05/29/topic-combiner-correctness.md

### combiner-output-indistinguishable-from-mapper-output [IN] OBSERVATION
The reducer reads combiner output from intermediate JSON files identically to raw mapper output, with no metadata distinguishing pre-aggregated values from raw emitted pairs
- Source: entries/2026/05/29/topic-combiner-correctness.md

### commit-aborts-on-conflict [IN] OBSERVATION
On write-write conflict, `commit()` calls `abort(tx)` internally and returns `False`, so callers never need to manually abort after a failed commit; the transaction is dead after a `False` return.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

### commit-conflict-check-not-atomic [IN] OBSERVATION
The conflict-check-then-commit sequence in `MVCCDatabase.commit()` is not atomic — no locking protects the window between checking versions and marking committed, requiring single-threaded execution for correctness.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

### commit-conflict-scans-all-versions [IN] OBSERVATION
The write-write conflict check in `commit()` scans all versions of each key in the transaction's write set (not just the latest), making conflict detection O(versions) per written key.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

### commit-only-in-batch [IN] OBSERVATION
`OP_COMMIT` is written exclusively by `append_batch()` in `write-ahead-log/wal.py`; single `append()` calls never produce a COMMIT record, so individual operations are their own atomic unit
- Source: entries/2026/05/29/topic-checkpoint-vs-commit-semantics.md

### commit-read-only-skip-conflict-check [IN] OBSERVATION
Read-only transactions bypass write-write conflict detection entirely in `commit()` and always commit successfully, since they cannot produce write-write conflicts.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

### commit-timestamp-unused-by-visibility [IN] OBSERVATION
Commit timestamps are assigned by `commit()` and stored in `_commit_timestamps`, but `_is_visible` never references them — visibility is determined purely by transaction IDs and the committed/active-at-start sets.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

### compact-closes-handles-before-keydir-fully-updated [IN] OBSERVATION
In `hash-index-storage/bitcask.py`, cached file readers for old immutable files are closed at the start of the merge-write phase, before all keydir entries have been updated to point to new locations, creating a window where reads would fail even without concurrent access
- Source: entries/2026/05/29/topic-concurrent-merge-safety.md

### compact-deletes-old-sstable-files [IN] OBSERVATION
`compact()` removes old SSTable files from disk after merging, meaning any concurrent reader holding a reference to a deleted SSTable reads stale data (on Unix) or crashes (on Windows)
- Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

### compact-filters-against-live-index [IN] OBSERVATION
During log-structured-hash-table compaction, records in frozen segments are only kept if `_index` still points to that exact `(path, offset)`, correctly excluding keys that were overwritten in the active segment after the frozen segment was closed.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

### compact-newest-wins-by-seq [IN] OBSERVATION
During compaction, the entry with the highest SSTable sequence number wins for each key; sequence numbers are per-SSTable (all entries in one SSTable share the same seq), not per-entry
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-compact.md

### compact-no-concurrency-safety [IN] OBSERVATION
`compact()` mutates `self._sstables` without locking; concurrent `_flush` or `get` calls during compaction can produce incorrect state or lose newly flushed SSTables
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-compact.md

### compact-not-crash-safe [IN] OBSERVATION
Log-structured-hash-table compaction is non-atomic: the sequence of write-new, update-index, delete-old, rename-active has no transaction boundary, and a crash mid-compaction can leave orphaned segments or a misnamed active file.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

### compact-purges-tombstones [IN] OBSERVATION
Tombstones (`b""`) are permanently removed during compaction and never written to the output SSTable; deleted keys disappear entirely after merge
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-compact.md

### compact-sets-base-offset-absolutely [IN] OBSERVATION
`Topic.compact_partition` overwrites `_base_offsets[partition]` with the first surviving message's offset (absolute assignment), discarding whatever the previous base was, because compaction removes messages from arbitrary positions.
- Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

### compact-single-threaded-assumption [IN] OBSERVATION
Log-structured-hash-table compaction has no synchronization protecting the index, file handles, or segment counter during its multi-step mutation sequence; concurrent access during compaction would corrupt state.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

### compact-skips-crc-validation [IN] OBSERVATION
Log-structured-hash-table compaction reads records from frozen segments without verifying CRC checksums, unlike `_scan_segment` which stops at the first CRC mismatch; a corrupted record could be silently copied into the compacted segment.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

### compaction-buffers-all-entries [IN] OBSERVATION
`lsm.py:compact` collects all merged entries into an in-memory list before writing the output SSTable, requiring O(n) memory proportional to total data rather than O(k) proportional to number of input SSTables
- Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

### compaction-crash-can-resurrect-deleted-keys [IN] OBSERVATION
A crash during LSM compaction after tombstones are stripped from the merged output but before old SSTables are deleted can resurrect previously-deleted keys, because the tombstone that suppressed them no longer exists in any surviving SSTable.
- Source: entries/2026/05/29/topic-crash-recovery-testing.md

### compaction-deletes-before-reader-release [IN] OBSERVATION
`compact()` deletes old SSTable files immediately after replacing `self._sstables`, with no mechanism to defer deletion until active readers holding references to the old list finish their iterations
- Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

### compaction-hazard-within-broken-durability-pipeline [IN] OBSERVATION
Compaction is the highest-risk operation yet operates within a durability pipeline broken at both ends: crash-unsafe compaction can permanently lose data or resurrect deletes, AND the surrounding infrastructure has fsync gaps (data may never reach disk) and unrecoverable integrity checks (corruption detected but not repaired) — the most dangerous operation has the least protection.

### compaction-is-critical-data-lifecycle-hazard [IN] OBSERVATION
Compaction is the critical junction where crash safety and data lifecycle intersect: crash-unsafe compaction can permanently lose data or resurrect deleted keys (no write-temp/fsync/rename, delete-before-rename ordering), and fragmented tombstone semantics mean those corruptions propagate inconsistently through derived systems that require flush and old-value tracking for correctness.

### compaction-is-explicit-not-background [IN] OBSERVATION
Both the LSM and SSTable modules trigger compaction via explicit synchronous method calls (`compact()` / `run_compaction()`), not background threads — removing the write-amplification pressure that motivates least-overlap selection in production systems.
- Source: entries/2026/05/29/topic-leveled-compaction-write-amplification.md

### compaction-is-unsalvageable-as-designed [IN] OBSERVATION
Compaction is unsalvageable as designed: it is the highest-risk operation within a durability pipeline broken at both ends (crash-unsafe writes that are unverifiable by testing), AND it triggers two independent failure modes under the concurrent access that production workloads produce (concurrent readers see inconsistent state, concurrent writers corrupt shared data structures), making it simultaneously the most critical and most dangerous operation.

### compaction-lacks-crash-safety-across-implementations [IN] OBSERVATION
No storage engine implementation uses the write-temp/fsync/rename pattern for file creation, and both Bitcask implementations delete old segments before renaming replacements, creating crash windows where data exists in neither old nor new files.

### compaction-manager-never-deletes-old-files [IN] OBSERVATION
`CompactionManager` removes compacted SSTables from the in-memory `_sstables` list but never deletes their underlying files on disk; cleanup is the caller's responsibility.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

### compaction-manager-never-removes-tombstones [IN] OBSERVATION
The `CompactionManager` in `sstable-and-compaction/sstable.py` always calls `merge_sstables` with the default `remove_tombstones=False`, making it conservatively correct but causing unbounded tombstone accumulation
- Source: entries/2026/05/28/topic-tombstone-gc-safety.md

### compaction-manager-supports-two-strategies [IN] OBSERVATION
`CompactionManager` implements both `'size_tiered'` and `'leveled'` compaction strategies, selected by a string parameter.
- Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

### compaction-not-atomic [IN] OBSERVATION
`LSMTree.compact()` performs a multi-step file swap (write new SSTable, update in-memory list, delete old files) with no mechanism to make the transition atomic across crashes
- Source: entries/2026/05/28/topic-manifest-based-compaction.md

### compaction-preserves-original-offsets [IN] OBSERVATION
After `compact_partition`, surviving messages retain their original offset values, creating non-contiguous gaps in the offset sequence; offsets are never renumbered.
- Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

### compaction-propagates-corruption [IN] OBSERVATION
`hash-index-storage/bitcask.py:compact` copies records without integrity validation, so silently corrupted data survives compaction into new files
- Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

### compaction-respects-size-limit [IN] OBSERVATION
Bitcask compaction output is itself subject to `max_file_size` rotation, so smaller size limits produce more post-compaction files — feeding back into recovery cost.
- Source: entries/2026/05/29/topic-wal-segment-sizing-tradeoffs.md

### compaction-result-assigned-level-1 [IN] OBSERVATION
`CompactionManager.run_compaction` with leveled strategy assigns merged output to level 1 but does not verify that the new file's key range is disjoint from existing L1 files
- Source: entries/2026/05/29/topic-compaction-output-splitting.md

### compaction-strategy-vs-scheduling-decoupled [IN] OBSERVATION
The `sstable.py` `CompactionManager` separates *which* SSTables to merge (size-tiered vs. leveled strategy) from *when* to merge, but neither the manager nor any caller implements scheduling, rate-limiting, or load-aware deferral logic.
- Source: entries/2026/05/29/topic-bitcask-merge-window-scheduling.md

### compaction-threshold-controls-overlap-window [IN] OBSERVATION
The `compaction_threshold` parameter (`lsm.py:204`) directly controls how many overlapping SSTables can accumulate before compaction, setting the worst-case missing-key probe count to `threshold - 1`
- Source: entries/2026/05/29/topic-size-tiered-vs-leveled-read-amplification.md

### compaction-triggered-by-sstable-count [IN] OBSERVATION
LSM compaction runs automatically when `len(self._sstables) >= self._compaction_threshold` (default 4), triggered at the end of `_flush` after the new SSTable is registered
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-_flush.md

### compare-returns-four-states [IN] OBSERVATION
`VectorClock.compare` returns exactly one of `BEFORE`, `AFTER`, `EQUAL`, or `CONCURRENT`, implementing the partial order defined by component-wise comparison.
- Source: entries/2026/05/29/vector-clocks-vector_clock.md

### concurrency-check-is-pre-mutation [IN] OBSERVATION
The `expected_version` optimistic concurrency check in `append_batch` runs before any state mutation, so a `ConcurrencyConflict` exception is safe and leaves the store completely unchanged.
- Source: entries/2026/05/29/event-sourcing-store-event_store-append_batch.md

### concurrency-unsafe-on-both-read-and-write-paths [IN] OBSERVATION
Concurrency safety is absent from both mutation and query paths: core components (B-tree, garbage collector, consistent hash ring) silently assume single-threaded write access with no locks or assertions, and range scans lack snapshot isolation, cycle guards, or concurrent modification protection, meaning concurrent workloads can corrupt state through both writes and reads independently.

### concurrent-access-during-compaction-is-doubly-unsafe [IN] OBSERVATION
Compaction under concurrent access is unsafe in two independent failure modes: concurrent readers and writers have no synchronization (no locks, latches, or snapshots on either path), AND compaction itself has no crash-safe file operations — concurrent access can corrupt live state silently, and a crash during this unsynchronized compaction produces irrecoverable data loss.

### conflict-requires-different-origins [IN] OBSERVATION
A conflict in `apply_remote_change` is only detected when `local_origin != remote_node`; two writes from the same origin at different timestamps are treated as sequential updates and resolved by tuple comparison without generating a `ConflictRecord`.
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-apply_remote_change.md

### conflict-resolution-architecture-is-split [IN] OBSERVATION
Conflict resolution is implemented in two disconnected modules: CRDTs encode resolution in their merge semantics (self-resolving), while the strategy enum covers only LWW and custom-merge — leaving no unified API and no bridge between the two approaches.

### consensus-and-membership-use-incompatible-convergence-models [IN] OBSERVATION
Gossip-based failure detection and Raft consensus interact in a way that can compound partition hazards: gossip's timeout-driven liveness set determines cluster membership, while Raft's partition behavior means an isolated leader silently accepts uncommittable writes and its inflated term forces re-election upon rejoining — if gossip's failure detection misclassifies a partitioned-but-live leader, it may trigger membership changes that interact with Raft's already-disruptive partition recovery.

### consensus-architectures-are-inverse-optimizations [IN] OBSERVATION
Raft and Total Order Broadcast represent contrasting approaches to Paxos optimization: Raft centralizes proposal authority in a single elected leader (one election per term, then Phase-2-only replication), eliminating dueling proposals within a term, while TOB allows any node to propose for any slot using full two-phase Paxos each time, making it susceptible to competing proposals. These represent different points in the leader-based versus leaderless tradeoff space.

### consensus-correctness-doubly-unverified [IN] OBSERVATION
Consensus protocol correctness is doubly unverified: both Raft and TOB represent untested inverse optimizations of Multi-Paxos whose safety properties diverge specifically under asynchrony and crash failures, AND the testing methodology covers neither crash nor asynchronous failure modes — the exact conditions under which the two optimization strategies would reveal different safety profiles.

### consensus-safety-unverified-across-optimization-variants [IN] OBSERVATION
Both consensus mechanisms represent inverse optimizations of Multi-Paxos (Raft centralizes proposal authority in a leader, TOB decentralizes it to allow any node to propose), but neither variant's safety has been validated under the asynchronous conditions it is designed for — all protocol tests use deterministic synchronous delivery with no real network I/O, message loss, or reordering.

### consistent-hash-add-node-is-idempotent [IN] OBSERVATION
Adding a node that already exists silently returns an empty transfer map and does not change the ring's node count, vnode positions, or key assignments.
- Source: entries/2026/05/29/consistent-hashing-test_consistent_hashing.md

### consistent-hash-default-150-vnodes [IN] OBSERVATION
`ConsistentHashRing` defaults to 150 virtual nodes per physical node, and exposes `load_imbalance` (max_load / avg_load) to measure distribution quality
- Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

### consistent-hash-get-nodes-first-equals-get-node [IN] OBSERVATION
`get_nodes(key)[0]` always equals `get_node(key)` — the preference list's head is the primary replica.
- Source: entries/2026/05/29/consistent-hashing-test_consistent_hashing.md

### consistent-hash-minimal-redistribution [IN] OBSERVATION
Adding an Nth node to the ring moves approximately 1/N of keys, not a full reshuffle — the defining property that makes consistent hashing useful for dynamic cluster membership.
- Source: entries/2026/05/29/consistent-hashing-test_consistent_hashing.md

### consistent-hash-replication-requires-sufficient-physical-nodes [IN] OBSERVATION
`get_nodes()` raises `ValueError` rather than silently under-replicating when `replication_factor` exceeds the number of physical nodes on the ring.
- Source: entries/2026/05/29/consistent-hashing-test_consistent_hashing.md

### consistent-hash-ring-lookup-is-olog-v [IN] OBSERVATION
Key lookup via `get_node` is O(log V) where V is total virtual nodes, using `bisect` on the sorted position list; node addition is O(V) per vnode due to `list.insert`.
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

### consistent-hash-ring-no-positive-validation [IN] OBSERVATION
`ConsistentHashRing` does not validate that `num_vnodes`, `weight`, or `replication_factor` are positive; zero or negative values silently produce degenerate ring states (e.g., a node with zero vnodes exists in `_nodes` but owns no ring arc).
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

### consistent-hash-ring-not-thread-safe [IN] OBSERVATION
`ConsistentHashRing` has no synchronization; concurrent `add_node`/`remove_node` calls corrupt the sorted `_ring_positions` and `_ring_nodes` lists.
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

### consistent-hash-ring-sorted-invariant [IN] OBSERVATION
`_ring_positions` is maintained in sorted order at all times via `bisect` insertion; `_ring_nodes` is kept parallel to it with `len(_ring_positions) == len(_ring_nodes)` always holding.
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

### consistent-hash-ring-uses-32-bit-space [IN] OBSERVATION
The ring's hash space is `[0, 2^32)`, as validated by `test_ring_position_valid_range`.
- Source: entries/2026/05/29/consistent-hashing-test_consistent_hashing.md

### consistent-hash-ring-uses-md5 [IN] OBSERVATION
All consistent hash ring positions are computed via MD5 truncated to 32 bits; the hash function is not pluggable
- Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

### consistent-hash-ring-uses-md5-truncated-to-32-bit [IN] OBSERVATION
The ring hashes keys and vnode identifiers using `hashlib.md5` truncated to 32 bits via `& 0xFFFFFFFF`, chosen for distribution quality rather than security.
- Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

### consistent-hash-weight-scales-vnodes [IN] OBSERVATION
The `weight` parameter on `add_node` scales virtual node count proportionally, supporting heterogeneous nodes with different capacities
- Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

### consumer-group-offset-resume [IN] OBSERVATION
A new consumer joining a group loads committed offsets from the broker (tested at `test_partitioned_log.py:152`), enabling offset-based resume but also duplicate redelivery on crash
- Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

### consumer-independent-positions [IN] OBSERVATION
Each `DerivedSystem` tracks its own LSN position independently via `_position`, allowing consumers to fall behind or catch up at different rates without blocking each other
- Source: entries/2026/05/29/unbundled-database-unbundled_database.md

### context-parameter-establishes-causality [IN] OBSERVATION
Passing a vector clock as `context` to `VersionedKVStore.put` declares the write descends from that version; omitting context treats the write as concurrent with all existing versions
- Source: entries/2026/05/29/vector-clocks-test_vector_clock.md

### convergence-rate-asymmetry-membership-vs-data [IN] OBSERVATION
Membership and data convergence operate at fundamentally different rates: gossip-based membership changes propagate in O(log N) rounds via epidemic-style random peer selection, but data convergence in ring topology requires O(N) sync rounds because each round advances changes by exactly one hop via store-and-forward requeuing — creating a window proportional to cluster size where the membership view is current but data remains stale.

### correctness-gap-widens-under-failure [IN] OBSERVATION
The gap between expected and actual system correctness widens under failure: distributed protocols require storage-layer guarantees (crash-safe compaction, atomic writes, complete CRC coverage) that no implementation provides, and the resulting divergence accumulates without bound because anti-entropy can detect but not fully reconcile the inconsistencies.

### correctness-unachievable-and-unfalsifiable [IN] OBSERVATION
System correctness is both unachievable and unfalsifiable: the gap between specification and implementation widens under every failure mode (distributed protocols require unmet storage guarantees, replica divergence accumulates without bound, storage degrades monotonically), and the testing methodology cannot detect these gaps (crash paths untested, protocols validated only under synchronous simulation) — the system cannot be correct and cannot discover that it is not.

### corruption-is-terminal-across-all-readers [IN] OBSERVATION
Corruption is a terminal condition across every binary record reader in the codebase: each stops at the first CRC failure with no resync capability, and the WAL specifically halts all replay rather than skipping the corrupt record, meaning any single corrupt byte truncates the recoverable history.

### corruption-propagates-through-all-data-pipelines [IN] OBSERVATION
Corruption propagates silently through every data transformation pipeline: both Bitcask compaction implementations copy records without CRC validation, and hint file generation reads source records without integrity checks — every transformation step amplifies corruption rather than filtering it.

### corruption-terminates-scan [IN] OBSERVATION
In `_scan_segment`, a CRC mismatch or short read stops scanning the entire segment; records after the corrupt point are silently lost from the index even if they are individually valid.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-_scan_segment.md

### count-is-barrier [IN] OBSERVATION
`Count` stage materializes its entire input into a `defaultdict` before yielding any output, making it a pipeline barrier that forces all upstream stages to complete first
- Source: entries/2026/05/29/batch-word-count-pipeline.md

### counting-bloom-4bit-safe-at-capacity [IN] OBSERVATION
At designed capacity (n ≤ expected_items), the expected number of saturated 4-bit counters is effectively zero because the per-position load follows Poisson(ln 2 ≈ 0.693), making P(count ≥ 15) ≈ 10⁻¹⁵
- Source: entries/2026/05/29/topic-counter-saturation-probability.md

### counting-bloom-filter-reference-counted [IN] OBSERVATION
`CountingBloomFilter` uses reference-counting semantics: an item added N times requires N `remove()` calls before it tests as absent via `__contains__`
- Source: entries/2026/05/29/bloom-filter-test_bloom_filter.md

### counting-bloom-overload-threshold [IN] OBSERVATION
At 10× overload (10× expected_items inserted), roughly 0.5% of CountingBloomFilter counters saturate; at 20× overload, approximately half saturate, effectively degrading the filter to a non-counting Bloom filter
- Source: entries/2026/05/29/topic-counter-saturation-probability.md

### counting-bloom-remove-needs-drop-tracking [IN] OBSERVATION
Using `CountingBloomFilter.remove()` during compaction would require `merge_sstables` to report which keys were discarded, which the current implementation does not do — duplicates and tombstones are silently skipped in the merge loop
- Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

### counting-bloom-supports-deletion [IN] OBSERVATION
`CountingBloomFilter` uses saturating 4-bit counters so entries can be removed via `remove()`, which standard `BloomFilter` cannot do — relevant for maintaining filters over mutable data structures
- Source: entries/2026/05/28/topic-bloom-filter-integration.md

### counting-bloom-trades-correctness-for-unneeded-capability [IN] OBSERVATION
The counting Bloom filter pays 8x memory overhead and introduces false negatives through its removal operation for a capability (element deletion) that the primary use case — immutable SSTables — never needs, since SSTables are write-once and discarded whole during compaction.

### counting-bloom-useful-for-memtable [IN] OBSERVATION
`CountingBloomFilter.remove()` is architecturally suited for maintaining filters over mutable in-memory structures (memtables) where keys can be overwritten without rescanning, not for SSTable compaction where the entire file is rewritten
- Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

### counting-bloom-uses-byte-per-counter [IN] OBSERVATION
CountingBloomFilter allocates one full byte per counter position (`bytearray(self._m)` at line 112) despite `counter_bits` defaulting to 4, using 8x the space of a standard BloomFilter's bit array rather than the expected 4x.
- Source: entries/2026/05/29/topic-cuckoo-filters.md

### crash-failure-paths-systematically-untested [IN] OBSERVATION
Crash and failure recovery paths are systematically excluded from the test suite: the WAL has no tests for truncated records or CRC mismatches, LSM crash testing covers only WAL replay and ignores compaction crashes entirely, and SSI write-skew tests exist only in standalone tester files outside the default pytest runner — the most critical correctness scenarios have the least test coverage.

### crash-recovery-both-broken-and-unverified [IN] OBSERVATION
Crash recovery is simultaneously broken and unverified: no storage engine has a safe crash recovery path (non-atomic compaction, batch-blind replay, metadata-excluding CRC), and crash/failure paths are systematically excluded from the test suite — recovery bugs will persist indefinitely because neither the broken mechanisms nor their absence is tested.

### crc-detects-not-prevents-data-loss [IN] OBSERVATION
CRC32 checksums in the WAL and Bitcask detect partial writes after a crash but cannot recover the lost data — they convert silent corruption into detected data loss, not into a recoverable state
- Source: entries/2026/05/29/topic-crash-safety-of-truncate.md

### crc-mask-is-python2-artifact [IN] OBSERVATION
The `& 0xFFFFFFFF` mask applied after `zlib.crc32` in all modules is a Python 2 compatibility artifact; Python 3's `zlib.crc32` already returns an unsigned 32-bit integer, making the mask harmless but unnecessary.
- Source: entries/2026/05/29/topic-crc32-vs-crc32c.md

### crc-no-cross-file-atomicity [IN] OBSERVATION
The CRC32 integrity checks in `log-structured-hash-table/bitcask.py` detect corrupt records within a segment but provide no protection against the cross-file atomicity problem during compaction.
- Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

### crc-polynomial-change-breaks-wire-format [IN] OBSERVATION
Switching from ISO 3309 (`zlib.crc32`) to Castagnoli (CRC-32C) is a backward-incompatible wire format change; existing data files would fail CRC verification without a format version mechanism in the file header.
- Source: entries/2026/05/29/topic-crc32-vs-crc32c.md

### crc32-covers-records-not-files [IN] OBSERVATION
CRC32 is computed per individual record or page (typically hundreds of bytes to a few KB), not per file; file-level corruption is detected only indirectly by failing to decode a record, not by any file-wide checksum
- Source: entries/2026/05/29/topic-crc32-vs-xxhash-choice.md

### crdt-eq-compares-semantic-state [IN] OBSERVATION
All four CRDT types implement `__eq__` to compare semantic state (not object identity), enabling `CRDTReplicaGroup.all_converged()` and the semilattice property tests to verify convergence via equality checks.
- Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

### crdt-merge-is-idempotent [IN] OBSERVATION
All four CRDT `merge` methods (`GCounter`, `PNCounter`, `LWWRegister`, `ORSet`) are idempotent: merging the same state twice produces the same result as merging once
- Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

### crdt-merge-is-idempotent-and-convergence-tested [IN] OBSERVATION
All four CRDT types demonstrate idempotent merge (re-merging produces no change), semantic equality comparison, and monotonic ORSet tombstones. The sync_all test confirms convergence after two rounds, consistent with merge being commutative and associative, though these properties are exercised by tests rather than proven by the antecedents alone.

### crdt-merge-returns-self [IN] OBSERVATION
`merge()` mutates and returns `self` (enabling chaining like `deepcopy(a).merge(b)`) rather than returning a new instance, which means callers must deepcopy before merging if they need to preserve the original state.
- Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

### crdt-mutation-api-is-non-defensive [IN] OBSERVATION
CRDT mutation APIs lack defensive programming patterns at the boundaries: merge mutates the receiver in place and returns self (requiring callers to defensively copy before merging to prevent aliasing), and remove operations on absent elements silently return False rather than raising errors, making accidental no-ops invisible.

### crdts-are-self-resolving [IN] OBSERVATION
Each CRDT type in `crdts.py` (GCounter, PNCounter, LWWRegister, ORSet) encodes conflict resolution in its own `merge()` method, requiring no external strategy enum or callback
- Source: entries/2026/05/29/topic-conflict-resolution-strategies.md

### crdts-are-state-based-cvrdts [IN] OBSERVATION
All four CRDT types (GCounter, PNCounter, LWWRegister, ORSet) use state-based replication via a `merge()` method that implements a join-semilattice; no operation log or causal delivery is used
- Source: entries/2026/05/29/topic-delta-state-crdts.md

### critical-wal-operations-always-force-fsync [IN] OBSERVATION
Checkpoints, batch commits, and segment rotations all bypass the configured sync mode by calling fsync unconditionally, creating a two-tier durability model where structural WAL operations are always durable even when individual record writes are not.

### data-distribution-and-processing-hit-independent-scalability-ceilings [IN] OBSERVATION
Data distribution and processing both hit fundamental scalability ceilings: neither partitioning strategy provides both reliable routing and ordered access (hash destroys order, range depends on unverified parallel-array invariants), while query and aggregation operations hide unbounded memory barriers behind uniform tuple interfaces, meaning neither the data layout nor the query path scales gracefully under load.

### data-scan-skips-value-bytes [IN] OBSERVATION
`_scan_data_file` in the hash-index Bitcask reads each record's header and key but seeks past value bytes; it is still O(data_size) because headers must be parsed sequentially to locate record boundaries
- Source: entries/2026/05/29/topic-bitcask-startup-cost.md

### ddia-config-via-constructor-params [IN] OBSERVATION
All 37 modules configure behavior entirely through constructor parameters (e.g., `max_file_size`, `sync_mode`, `page_size`) — there are no config files, environment variables, or settings modules.
- Source: entries/2026/05/29/repo-overview.md

### ddia-modules-self-contained [IN] OBSERVATION
Each top-level directory is a self-contained module with no cross-module imports; implementations are intentionally standalone so each concept can be understood in isolation.
- Source: entries/2026/05/28/scan-ddia-implementations.md

### ddia-pure-python-stdlib [IN] OBSERVATION
All implementations are Python; the storage engine modules use only stdlib except for `sortedcontainers` in the LSM tree — no frameworks or production infrastructure.
- Source: entries/2026/05/28/scan-ddia-implementations.md

### ddia-tester-files-stdout-validation [IN] OBSERVATION
The `tester_test_*.py` files are designed as standalone scripts that print `"test_name PASSED"` / `"test_name FAILED"` to stdout and include `if __name__ == "__main__"` blocks, enabling automated verification without pytest.
- Source: entries/2026/05/29/repo-overview.md

### ddia-three-file-pattern [IN] OBSERVATION
Most modules follow a consistent three-file pattern: one implementation file, one test file, and a `tester_test_*.py` file (likely a meta-test or test harness validator).
- Source: entries/2026/05/28/scan-ddia-implementations.md

### decode-with-id-no-format-validation [IN] OBSERVATION
`decode_with_id` does not validate the message format before parsing; any 4+ byte input will be interpreted as schema-ID-prefixed data, producing a confusing KeyError on invalid input rather than a clear format error
- Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

### deepcopy-prevents-snapshot-corruption [IN] OBSERVATION
`save_snapshot()` uses `copy.deepcopy` on the projection state dict to create an independent copy, ensuring subsequent event processing during `catch_up()` does not mutate the saved snapshot
- Source: entries/2026/05/29/topic-event-sourcing-snapshots.md

### degradation-is-irreversible [IN] OBSERVATION
System degradation is irreversible: the system degrades monotonically at every abstraction level with no self-healing (leaked pages, growing tree height, accumulating divergence), and no subsystem at any architectural tier has a viable recovery strategy to arrest the erosion — recovery infrastructure is either vestigial, paradoxically over-engineered, or fundamentally missing.

### delete-before-rename-ordering [IN] OBSERVATION
Both Bitcask implementations delete old segment files (`os.remove`) before renaming the compacted replacement, creating a crash window where neither old nor properly-named new data is on disk
- Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

### delete-semantics-fragmented-from-storage-to-derived-systems [IN] OBSERVATION
Delete propagation is fragmented end-to-end: tombstone representations differ at every storage layer (ambiguous empty-bytes sentinel, premature compaction purging, replication-dependent lifetime), and derived systems require explicit flush plus old-value CDC events that inconsistent tombstones cannot reliably provide.

### deletion-visibility-is-symmetric [IN] OBSERVATION
The same three-condition visibility rule (committed, not in active_at_start, lower tx_id) is applied independently to both the creating and deleting transaction of a version — `_is_visible` runs the same logic in two phases.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-_is_visible.md

### derived-system-consistency-requires-flush-and-old-values [IN] OBSERVATION
Derived systems (secondary indexes, materialized views) require two independent conditions for consistency: explicit flush to make mutations visible, and CDC old-value capture to remove stale index entries — either condition failing silently produces stale reads.

### derived-systems-are-position-tracked [IN] OBSERVATION
Every `DerivedSystem` in the unbundled database tracks a `position` cursor indicating how far through the CDC log it has consumed, enabling independent catch-up and ensuring consumers can resume from where they left off
- Source: entries/2026/05/29/topic-ddia-ch12-unbundling.md

### derived-systems-depend-on-unreliable-event-infrastructure [IN] OBSERVATION
The derived-system pattern (secondary indexes, materialized views, projections) depends on event infrastructure that is unreliable at its foundation: event sourcing has no durable recovery mechanism and conflates two ID spaces, while derived systems require explicit flush calls and old-value CDC events for consistency — the consumer-side correctness requirements depend on producer-side guarantees that do not hold.

### derived-systems-independently-position-tracked [IN] OBSERVATION
Each `DerivedSystem` tracks its own LSN `position` independently, allowing consumers to fall behind or catch up at different rates without coordinator state or blocking
- Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

### derived-systems-rebuildable-from-cdc [IN] OBSERVATION
Every `DerivedSystem` implements `rebuild(events)` that clears state and replays from scratch, guaranteeing eventual convergence with the CDC event log regardless of prior state
- Source: entries/2026/05/29/unbundled-database-unbundled_database.md

### deterministic-serialization-enables-independent-verification [IN] OBSERVATION
`json.dumps(sort_keys=True)` ensures all honest PBFT nodes produce identical SHA-256 digests for identical requests without inter-node coordination, transforming trust verification into a deterministic computation
- Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

### digest-binds-view-sequence-to-request [IN] OBSERVATION
The `accepted_preprepare` dict enforces a one-to-one mapping from `(view, sequence)` to digest, preventing a Byzantine primary from assigning two different requests to the same protocol slot
- Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

### digest-recomputed-on-preprepare [IN] OBSERVATION
Backup replicas independently recompute the SHA-256 digest from the request payload in every PRE_PREPARE message; they never trust the primary's claimed digest value
- Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

### directory-scan-recovery-unsafe-during-compaction [IN] OBSERVATION
`_load_existing_sstables` using `os.listdir` cannot distinguish pre-compaction from post-compaction file sets after a crash — if the crash occurs after creating the merged SSTable but before deleting the inputs, recovery sees duplicates; the inverse loses data.
- Source: entries/2026/05/29/topic-manifest-and-crash-recovery.md

### distributed-cluster-lacks-reliable-infrastructure [IN] OBSERVATION
The distributed cluster has no reliable foundational infrastructure: membership detection is unreliable in both accuracy (no adaptive thresholds) and propagation (asymmetric convergence rates between membership and data), and ordering infrastructure is broken at every layer (vestigial WAL sequence numbers, conflated event sourcing ID spaces, volatile CDC consumer positions), meaning neither who is in the cluster nor what order things happened can be answered reliably.

### distributed-correctness-doubly-unachievable-under-partition [IN] OBSERVATION
Distributed correctness is doubly unachievable under network partitions: protocols require storage-layer guarantees (crash-safe compaction, CRC-protected metadata) that aren't met, AND partitions amplify the resulting gaps through disrupted gossip-based failure detection and stale leader writes — the prerequisites for correctness are absent even before partitions introduce additional failure modes.

### distributed-correctness-undermined-at-both-layers [IN] OBSERVATION
Distributed system correctness is undermined at both the storage and protocol layers: storage engines silently assume single-threaded access (no locks, no assertions, no documentation), while the quorum protocol weakens its own semantic guarantees (counting hint storage as successful writes, allowing sub-quorum configurations without error).

### distributed-divergence-accumulates-without-bound [IN] OBSERVATION
Replica divergence accumulates without bound: write operations have compounding correctness gaps (sloppy quorums count hints, sub-quorum configs accepted, conflict resolution split across modules), and the repair mechanism (Merkle-based anti-entropy) cannot fully reconcile because tombstone semantics differ at every layer.

### distributed-protocols-rest-on-unverifiable-assumptions [IN] OBSERVATION
Distributed protocols rest on doubly invalid foundations: end-to-end correctness requires storage-layer guarantees (crash-safe compaction, CRC-protected metadata) that no storage engine provides, and protocol safety claims are unfalsifiable under the current testing methodology (synchronous simulation, no crash path tests) — the protocols assume both correct storage and correct testing, and have neither.

### distributed-protocols-simulate-synchronous-delivery [IN] OBSERVATION
All distributed protocol implementations use synchronous message delivery: PBFT runs the full three-phase protocol in a single deterministic call, bully elections resolve cascading responses within one tick, and Lamport clocks deliver messages in the same call stack — none model the network asynchrony that is the core difficulty of distributed systems.

### distributed-tombstone-gc-requires-downtime-bound [IN] OBSERVATION
Any tombstone garbage collection strategy must define a maximum tolerated node downtime; tombstones removed before a down node receives them cause data resurrection via read repair or anti-entropy.
- Source: entries/2026/05/29/topic-tombstone-gc-and-repair-window.md

### distributed-tombstone-removal-needs-replication-convergence [IN] OBSERVATION
In the multi-leader replication module, tombstones cannot be safely removed until all replicas have received the delete, adding a replication-convergence constraint beyond the local compaction-coverage constraint
- Source: entries/2026/05/28/topic-tombstone-gc-safety.md

### distributed-writes-have-compounding-correctness-gaps [IN] OBSERVATION
Distributed write correctness has compounding weaknesses: quorum semantics are weakened by sloppy quorum and permissive configuration, and the conflict resolution that should catch remaining inconsistencies is split between two disconnected mechanisms (self-resolving CRDTs and an incomplete strategy enum).

### do-sync-force-bypasses-sync-mode [IN] OBSERVATION
`_do_sync(force=True)` always calls fsync regardless of the configured `sync_mode`, ensuring durability-critical operations like segment rotation and WAL close are never silently skipped
- Source: entries/2026/05/29/topic-sync-mode-none-safety.md

### doc-partitioned-query-touches-all [IN] OBSERVATION
`DocumentPartitionedDB.query_by_field` always iterates every partition regardless of result count, touching exactly `num_partitions` partitions (scatter/gather with no short-circuit).
- Source: entries/2026/05/29/secondary-index-partitioning-secondary_index_partitioning.md

### doc-partitioned-write-touches-one [IN] OBSERVATION
`DocumentPartitionedDB.put()` always touches exactly 1 partition regardless of how many fields are indexed, because the secondary index is co-located with the document on its home partition.
- Source: entries/2026/05/29/secondary-index-partitioning-test_secondary_index_partitioning.md

### durability-bugs-invisible-to-testing [IN] OBSERVATION
Durability bugs are permanently invisible: the write-to-verify durability pipeline is broken at both ends (fsync policy gaps prevent data from reaching stable storage, and incomplete integrity checks cannot verify it arrived), while crash/failure recovery paths are systematically excluded from testing — the system cannot detect its own durability failures through any available mechanism.

### durability-pipeline-broken-at-both-ends [IN] OBSERVATION
The write-to-verify durability pipeline is broken at both ends: fsync policy inconsistencies across critical paths mean data may never durably reach disk (B-tree skips fsync for structural metadata while paying double for data), and incomplete CRC coverage means corrupted data that does reach disk passes integrity checks undetected (payload-only CRC leaves routing metadata unprotected).

### dynamo-cluster-hints-after-quorum [IN] OBSERVATION
`DynamoCluster.put` only stores hints after the write quorum is already met on real nodes, unlike `HintedHandoffStore` which counts hints toward the quorum — making DynamoCluster a strict-quorum-with-opportunistic-hints design, not a true sloppy quorum
- Source: entries/2026/05/29/topic-sloppy-quorum-tradeoffs.md

### dynamo-conflict-detection-same-version-different-values [IN] OBSERVATION
When multiple replicas hold different values at the same max version number, `ReadResult.is_conflict` is `True` and `value` is a list of all conflicting values rather than a single resolved value
- Source: entries/2026/05/29/leaderless-replication-test_dynamo_tester.md

### dynamo-failed-write-rolls-back-version [IN] OBSERVATION
When `put()` fails to meet write quorum, it decrements the version counter to prevent version gaps, then raises `QuorumNotMet`
- Source: entries/2026/05/29/leaderless-replication-dynamo.md

### dynamo-hints-single-homed [IN] OBSERVATION
Hinted handoffs for all unavailable nodes are stored on `available_nodes[0]` only, creating a single point of responsibility for hint delivery and a potential hotspot
- Source: entries/2026/05/29/leaderless-replication-dynamo.md

### dynamo-no-delete-support [IN] OBSERVATION
The leaderless replication implementation (`dynamo.py`) has no delete or tombstone mechanism; adding deletes without tombstones would cause resurrection via read repair or anti-entropy
- Source: entries/2026/05/29/topic-sstable-tombstone-gc-safety.md

### dynamo-per-key-versioning [IN] OBSERVATION
Version counters in `DynamoCluster._version_counters` are scoped per key, not global across the cluster; independent keys maintain independent version sequences
- Source: entries/2026/05/29/leaderless-replication-test_dynamo_tester.md

### dynamo-read-repair-is-all-node [IN] OBSERVATION
`DynamoCluster.get()` repairs all available nodes after every read, not just the quorum participants, trading higher per-read write fan-out for faster convergence compared to `ReadRepairStore`.
- Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

### dynamo-read-repair-is-eager [IN] OBSERVATION
`DynamoCluster.get()` repairs all stale replicas reachable during the read, not just enough to satisfy the quorum — every replica with a version below the max is updated
- Source: entries/2026/05/29/leaderless-replication-dynamo.md

### dynamo-sloppy-quorum-opt-in [IN] OBSERVATION
Hinted handoff only occurs when `DynamoCluster` is constructed with `sloppy_quorum=True`; without it, `deliver_hints()` returns 0 and offline nodes receive no data until anti-entropy runs
- Source: entries/2026/05/29/leaderless-replication-test_dynamo_tester.md

### dynamo-version-counter-is-global [IN] OBSERVATION
Version assignment uses a single coordinator-level counter per key (`_version_counters`), not per-replica counters, which prevents version conflicts but requires a centralized coordinator
- Source: entries/2026/05/29/leaderless-replication-dynamo.md

### dynamo-write-fans-to-all [IN] OBSERVATION
`DynamoCluster.put()` sends writes to every available node, not just W of them; the quorum check gates on acknowledgment count, not send count, maximizing replica consistency
- Source: entries/2026/05/29/leaderless-replication-dynamo.md

### empty-predicate-match-still-locks [IN] OBSERVATION
A predicate that matches zero keys still appends a predicate lock to `tx._predicate_locks`, enabling phantom detection if a concurrent transaction later inserts a key that would match
- Source: entries/2026/05/29/write-skew-detection-ssi_database-read_predicate.md

### encode-with-id-no-magic-byte [IN] OBSERVATION
`SchemaRegistry.encode_with_id` uses a raw 4-byte big-endian schema ID prefix with no magic byte, unlike Confluent's 5-byte header (0x00 + 4-byte ID), so there is no format versioning for future wire format changes
- Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

### end-to-end-correctness-requires-unmet-storage-guarantees [IN] OBSERVATION
End-to-end distributed correctness is unachievable: protocol-layer weaknesses (sloppy quorums, single-threaded assumptions) depend on storage-layer guarantees (atomic recovery, batch integrity, metadata checksums) that no implementation provides.

### entry-count-header-never-verified [IN] OBSERVATION
The entry count stored in the SSTable header by `SSTableWriter.finish()` is trusted by `SSTableReader` but never validated against the actual number of entries read from the data section
- Source: entries/2026/05/29/topic-sstable-magic-number-vs-crc.md

### equivocating-mode-expands-message-count [IN] OBSERVATION
In `EQUIVOCATING` mode, each broadcast message is replaced by N-1 targeted messages with per-peer distinct digests computed via `compute_digest`, meaning the output list can be larger than the input
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_apply_byzantine.md

### es-event-ids-are-stream-scoped [IN] OBSERVATION
Event IDs are sequential integers per-stream starting at 1, not globally unique identifiers; `global_position` is the separate cross-stream counter.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-global-position-tracks-all-streams [IN] OBSERVATION
`global_position` increments monotonically across all streams and equals the total number of events ever appended to the store.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-live-projection-auto-updates [IN] OBSERVATION
`LiveProjection` automatically applies new events on each `append()` after the initial `catch_up()` call — no subsequent explicit `catch_up()` required.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-optimistic-concurrency-via-expected-version [IN] OBSERVATION
`append(..., expected_version=V)` succeeds only if the current stream version equals V; a stale version raises `ConcurrencyConflict`.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-persistence-uses-jsonl [IN] OBSERVATION
EventStore disk persistence uses JSONL (newline-delimited JSON) format, loaded in full on construction to rebuild the in-memory state.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-projection-catch-up-is-incremental [IN] OBSERVATION
`Projection.catch_up()` only processes events after its current `position`, returning the count of newly processed events; a second call with no new events processes zero.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-snapshot-saves-state-and-position [IN] OBSERVATION
`Projection.save_snapshot()` persists both the projection state and its current position; `load_snapshot()` restores both so that `catch_up()` resumes from the snapshot point.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-snapshots-are-in-memory-dicts [IN] OBSERVATION
Event sourcing snapshots on `store._snapshots` are plain Python dicts monkey-patched onto the store instance with no serialization; `getattr(self._store, '_snapshots', {})` defensively handles their absence
- Source: entries/2026/05/29/topic-snapshot-persistence.md

### es-snapshots-incomplete-for-production-use [IN] OBSERVATION
Event sourcing snapshots are triply limited: stored as ephemeral in-memory dicts (lost on restart), excluding subscriber callbacks (requiring re-registration), and ignored by `reconstruct_state` (which always replays from the beginning).

### es-temporal-query-via-up-to [IN] OBSERVATION
`reconstruct_state(handlers, events, up_to=N)` replays exactly the first N events to rebuild past state at that point in time.
- Source: entries/2026/05/29/event-sourcing-store-test_event_store.md

### es-verify-is-script-not-pytest [IN] OBSERVATION
`test_verify.py` runs as a standalone Python script with inline assertions and a final print statement, not through a test framework like pytest
- Source: entries/2026/05/29/event-sourcing-store-test_verify.md

### event-graph-acyclicity-by-construction [IN] OBSERVATION
`_reaches` assumes the event graph is a DAG with no cycle detection; this is safe because `Node` methods only ever link to previously-created events via `_parent` and `_cause`.
- Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

### event-graph-correctness-via-dual-invariant [IN] OBSERVATION
The event causality graph maintains correctness through two complementary invariants: acyclicity by construction (preventing infinite traversal in reachability checks) and identity-based comparison (preventing conflation of structurally-equal but distinct events).

### event-ids-are-1-based-sequential [IN] OBSERVATION
`event_id` is a global sequence number assigned as `len(self._events) + 1`, starting at 1 and incrementing by 1 for each appended event with no gaps within a process lifetime
- Source: entries/2026/05/29/event-sourcing-store-event_store.md

### event-infrastructure-unreliable-in-content-and-ordering [IN] OBSERVATION
Derived systems depend on event infrastructure that is independently unreliable in both content and ordering: events may be lost, duplicated, or mis-addressed due to unreliable persistence and addressing, AND their ordering is broken at every layer — WAL sequence numbers are vestigial, event IDs conflate two spaces, and CDC consumer positions are volatile.

### event-sourcing-and-cdc-converge-on-projection-pattern [IN] OBSERVATION
Both `Projection.catch_up` and `DerivedSystem.process_event` track a position cursor, pull/receive ordered events, and apply type-dispatched handlers — the same structural pattern solving "derived data" from different starting points
- Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

### event-sourcing-conflates-two-id-spaces [IN] OBSERVATION
Event sourcing read operations conflate two ID spaces: projections assume contiguous stream-scoped IDs for catch-up arithmetic, while state reconstruction filters on global event_id, and sequence numbers survive compaction without renumbering — creating potential mismatches when events are removed or compacted.

### event-sourcing-lacks-any-durable-recovery-path [IN] OBSERVATION
Event sourcing has no durable recovery mechanism: appends update in-memory state before persisting to disk with no rollback on failure, snapshots are ephemeral in-memory dicts lost on restart, and state reconstruction must always replay from the beginning with no incremental checkpoint support.

### event-sourcing-unreliable-in-persistence-and-addressing [IN] OBSERVATION
Event sourcing is fundamentally unreliable at two independent levels: it has no durable recovery mechanism (in-memory state updated before disk, no crash-consistent append, ephemeral snapshots), and its core addressing abstraction conflates stream-scoped and global ID spaces across read paths, meaning even successfully persisted events may be incorrectly reconstructed.

### event-store-append-is-not-crash-consistent [IN] OBSERVATION
Event store appends update in-memory state before persisting to disk with no rollback on failure, batch appends can leave partial writes, and the per-call file open/close pattern adds overhead without providing atomicity guarantees.

### event-store-load-ignores-partial-batches [IN] OBSERVATION
`EventStore._load_from_file` replays every NDJSON line on startup with no mechanism to detect or discard incomplete batches left by a mid-batch crash, unlike the WAL which uses `OP_COMMIT` records as transaction boundaries.
- Source: entries/2026/05/29/topic-batch-atomicity-gap.md

### event-store-memory-ahead-of-disk [IN] OBSERVATION
Event store appends events to the in-memory `_events` list before calling `_persist_event`, with no rollback on write failure — a crash or I/O error leaves the in-memory state ahead of disk with no mechanism to reconcile.
- Source: entries/2026/05/29/topic-crash-recovery-invariants.md

### event-store-ndjson-could-resync-but-doesnt [IN] OBSERVATION
The event store uses newline-delimited JSON where per-line resync is architecturally possible, but `_load_from_file` halts on the first `json.JSONDecodeError` rather than skipping the bad line
- Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

### event-store-no-explicit-durability [IN] OBSERVATION
`event-sourcing-store/event_store.py:_persist_event` relies solely on Python's context manager `close()` for implicit `flush()` with no explicit `flush()` or `fsync()`, providing no durability guarantee against either process crash or power loss.
- Source: entries/2026/05/29/topic-bitcask-durability-tradeoffs.md

### event-store-optimistic-concurrency [IN] OBSERVATION
`EventStore.append()` accepts an `expected_version` parameter and rejects appends when the stream's current version doesn't match, implementing optimistic concurrency control for event streams
- Source: entries/2026/05/29/topic-ddia-ch12-unbundling.md

### event-store-optimistic-concurrency-via-expected-version [IN] OBSERVATION
`EventStore.append` takes an `expected_version` parameter for optimistic concurrency on stream appends, rejecting writes when the stream has advanced past the caller's version
- Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

### event-store-persist-no-durability [IN] OBSERVATION
`EventStore._persist_event()` in `event-sourcing-store/event_store.py` opens the file in a `with` block, writes JSON, and relies on implicit `close()` — no explicit `flush()` or `os.fsync()`, making persisted events vulnerable to loss on OS crash
- Source: entries/2026/05/29/topic-os-fsync-durability.md

### event-store-persist-no-flush-no-fsync [IN] OBSERVATION
`EventStore._persist_event` opens the NDJSON file in append mode, writes one JSON line, and closes it without calling `flush()` or `os.fsync()`; written data may remain in OS buffers or Python buffers at crash time.
- Source: entries/2026/05/29/topic-batch-atomicity-gap.md

### event-store-persist-pattern [IN] OBSERVATION
`EventStore` in `event-sourcing-store/event_store.py` is the repo's existing model for optional disk persistence: accept `persist_path` in constructor, call `_load_from_file` on init if the file exists, call `_persist_event` on every mutation
- Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

### event-store-persist-separate-open-close [IN] OBSERVATION
`EventStore._persist_event` performs a separate `open()`/`write()`/`close()` cycle for each individual event, so `append_batch` with N events results in N independent file operations rather than a single buffered write.
- Source: entries/2026/05/29/topic-batch-atomicity-gap.md

### event-store-single-threaded-assumption [IN] OBSERVATION
`EventStore` assumes single-threaded access: `event_id = len(self._events) + 1` is a race condition under concurrency, and no locking protects `_events`, `_streams`, or subscriber notification.
- Source: entries/2026/05/29/event-sourcing-store-event_store-append_batch.md

### event-store-single-writer-assumption [IN] OBSERVATION
The event store's `expected_version` check-then-act pattern in `append` and `append_batch` has no locking, making the optimistic concurrency guard safe only under a single-writer concurrency model — concurrent writers create a TOCTOU race
- Source: entries/2026/05/29/topic-event-sourcing-concurrency-model.md

### expiration-cutoff-is-one-sided-despite-symmetric-matching [IN] OBSERVATION
`_expire_events` uses `watermark - duration` as a one-sided cutoff for buffer cleanup, while `contains()` is symmetric; this is correct because future events can only arrive at or after the watermark
- Source: entries/2026/05/29/topic-sliding-vs-tumbling-windows.md

### external-sort-is-transparent [IN] OBSERVATION
`Sort` with `memory_limit` produces the same output ordering as an in-memory sort — the external merge-sort is a pure implementation optimization with no semantic difference.
- Source: entries/2026/05/29/batch-word-count-test_pipeline.md

### fdatasync-never-used [IN] OBSERVATION
No implementation uses `os.fdatasync()`, missing a safe optimization for append-only WALs where only data (not metadata like mtime) needs to reach disk.
- Source: entries/2026/05/29/topic-fsync-flush-distinction.md

### fdatasync-not-available-on-darwin [IN] OBSERVATION
Python's `os.fdatasync` exists only on Linux, not macOS/Darwin; any switch from `os.fsync` requires a platform fallback via `getattr(os, 'fdatasync', os.fsync)` to work on the current development environment
- Source: entries/2026/05/29/topic-fdatasync-vs-fsync-optimization.md

### fdatasync-safe-for-all-append-paths [IN] OBSERVATION
Every file opened in append mode (`"ab"`) in this codebase writes sequentially with monotonically increasing file size, making `fdatasync` a safe drop-in replacement for `fsync` on those paths (WAL, Bitcask data files, B-tree WAL)
- Source: entries/2026/05/29/topic-fdatasync-vs-fsync-optimization.md

### fenced-server-rejects-strictly-lower-tokens [IN] OBSERVATION
`FencedResourceServer.write` rejects writes where `fencing_token < highest` (strict less-than, not less-than-or-equal), meaning same-token retries succeed — enabling idempotent writes with the same lock acquisition
- Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

### fencing-no-exceptions-on-denial [IN] OBSERVATION
All denial conditions in fencing tokens (lock held, stale token, wrong client) return sentinel values (`None`, `False`, or error dicts) rather than raising exceptions, modeling distributed system responses
- Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

### fencing-rejects-stale-writes [IN] OBSERVATION
`FencedResourceServer.write()` rejects any write with a fencing token strictly less than the highest token previously seen for that resource; equal tokens are accepted
- Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

### fencing-token-counter-is-global [IN] OBSERVATION
The fencing token counter is global across all locks, not per-lock; tokens issued for different locks are comparable and strictly ordered by acquisition time
- Source: entries/2026/05/29/fencing-tokens-test_fencing_tokens.md

### fencing-token-monotonic [IN] OBSERVATION
`LockService._counter` starts at 1, increments by 1 on every successful `acquire()`, and is never decremented or reset, guaranteeing strict monotonicity of issued tokens
- Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

### fencing-token-tracking-is-per-resource [IN] OBSERVATION
Token high-water marks on `FencedResourceServer` are scoped per resource identifier, so a lower token can succeed on a different resource than the one that saw a higher token
- Source: entries/2026/05/29/fencing-tokens-test_fencing_tokens.md

### fencing-tokens-do-not-expire-at-resource-server [IN] OBSERVATION
The `FencedResourceServer` has no TTL or expiration logic for tokens — once a token number is seen, all lower tokens are permanently rejected for that resource
- Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

### find-conflicts-detects-concurrency [IN] OBSERVATION
`find_conflicts` returns `True` if and only if at least two `VersionedValue` entries in the input list have concurrent (mutually non-dominating) vector clocks
- Source: entries/2026/05/29/vector-clocks-test_vector_clock.md

### find-leaf-bisect-right-consistency [IN] OBSERVATION
_find_leaf, _search, and _insert all use bisect_right (not bisect_left) for child-pointer routing, ensuring all three methods agree on which leaf owns a given key; this is consistent with the split strategy where the separator key equals the first key of the right sibling.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_find_leaf.md

### find-leaf-no-type-check [IN] OBSERVATION
_find_leaf does not verify that a page is an internal node before calling _deserialize_internal; a corrupted tree height or misidentified page type produces silently wrong results rather than an error.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_find_leaf.md

### find-leaf-range-scan-only [IN] OBSERVATION
_find_leaf is called exclusively by range_scan; point lookups use _search instead, which reads the leaf inline and returns the value rather than the page number.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_find_leaf.md

### first-committer-wins-on-write-conflict [IN] OBSERVATION
When two concurrent transactions write the same key, the first to call `commit()` succeeds and the second is aborted automatically
- Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

### fixture-usage-is-sparse [IN] OBSERVATION
Only 3 of the `test_*.py` files use `@pytest.fixture` (`test_cdc`, `test_map_side_joins`, `test_bitcask`); most pytest files still use inline setup identical to their tester counterparts
- Source: entries/2026/05/29/topic-tester-vs-pytest-test-duality.md

### flush-before-fsync-invariant [IN] OBSERVATION
Every `os.fsync()` call in the codebase is immediately preceded by a `.flush()` call, ensuring Python's userspace buffer is drained to the kernel page cache before requesting stable storage persistence.
- Source: entries/2026/05/29/topic-fsync-vs-flush-semantics.md

### flush-clears-then-appends [IN] OBSERVATION
`_flush()` both clears `self._memtable` and appends to `self._sstables`, creating a window where data exists in neither location if a concurrent reader checks between the two mutations
- Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

### flush-creates-sequential-sstables [IN] OBSERVATION
Each `_flush` call creates an SSTable file with a strictly increasing sequence number (zero-padded filename for lexicographic = numeric sort), preserving newest-last ordering in `self._sstables`
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-_flush.md

### flush-index-drains-pending [IN] OBSERVATION
`flush_index()` is the sole mechanism that materializes async index updates: it iterates `self._pending`, calls `_apply_index_op()` for each entry, and returns the count of operations applied
- Source: entries/2026/05/29/topic-async-index-consistency.md

### flush-skips-immutable-memtable-stage [IN] OBSERVATION
`_flush` writes the frozen memtable directly to an SSTable without staging it in `_immutable_memtables`, creating a brief window where in-flight keys are invisible to `get()` despite `_immutable_memtables` being checked in the read path
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-_flush.md

### force-true-bypasses-sync-mode [IN] OBSERVATION
Passing `force=True` to `_do_sync()` causes flush+fsync regardless of the configured sync mode, enabling group commit at batch boundaries
- Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

### force-true-used-at-rotation-and-close [IN] OBSERVATION
The only `_do_sync(force=True)` call sites (lines 165 and 175) correspond to segment rotation and WAL close, both operations where proceeding without a flush would risk data loss or file corruption
- Source: entries/2026/05/29/topic-sync-mode-none-safety.md

### format-and-structure-prevent-all-repair [IN] OBSERVATION
The system can neither self-heal during operation nor be evolved to add self-healing capabilities: runtime degradation is permanent (leaked pages accumulate, tree height only grows, no rebalancing occurs, crash recovery has no safe path), AND the rigid binary formats across the entire storage stack prevent adding recovery mechanisms such as resync points, version negotiation, or structural checksums.

### format-rigidity-prevents-evolutionary-repair [IN] OBSERVATION
The system cannot evolve its way out of known corruption vulnerabilities: the rigid binary format design across the entire storage stack prevents forward evolution and post-corruption recovery (no block alignment, no version fields, no extensibility), and the system simultaneously lacks defense-in-depth against the corruption these formats cannot recover from (no input validation, no resync capability).

### free-list-head-persisted-in-page-zero [IN] OBSERVATION
The free list head pointer is stored as the fifth field of the metadata page (page 0) in the layout `[root_page:4B][height:4B][total_keys:4B][next_free_page:4B][free_list_head:4B]`, making it durable across restarts.
- Source: entries/2026/05/29/topic-intrusive-free-list-pattern.md

### free-list-is-lifo [IN] OBSERVATION
The page free list is LIFO: `free_page` pushes to the head and `allocate_page` pops from the head, so the most recently freed page is reused first.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-free_page.md

### free-list-is-lifo-stack [IN] OBSERVATION
The B-tree's free page list is a LIFO stack: `free_page` pushes onto the head and `allocate_page` pops from the head, so pages are reused in reverse order of when they were freed.
- Source: entries/2026/05/29/topic-free-list-fragmentation.md

### free-list-layout-coupled [IN] OBSERVATION
`allocate_page` and `free_page` share an implicit contract on free-list node layout (3-byte zeroed header + 4-byte big-endian next-pointer); neither validates the other's output, so a format change in one silently breaks the other
- Source: entries/2026/05/28/b-tree-storage-engine-btree-allocate_page.md

### free-list-lifo-decorrelates-layout [IN] OBSERVATION
LIFO free-list reuse means delete-then-reinsert cycles progressively decorrelate logical key order from physical page order, defeating sequential I/O readahead for range scans.
- Source: entries/2026/05/29/topic-free-list-fragmentation.md

### free-page-only-called-from-delete [IN] OBSERVATION
`PageManager.free_page` is only invoked by `BTree._delete` when removing an emptied leaf node; no other code path frees pages.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-free_page.md

### free-page-overwrites-content [IN] OBSERVATION
`PageManager.free_page()` overwrites the freed page's content with a zero header and a free-list pointer, destroying the original leaf/internal node data so freed pages cannot be accidentally read as valid tree nodes.
- Source: entries/2026/05/29/topic-free-list-reuse-under-churn.md

### free-page-stores-only-seven-bytes [IN] OBSERVATION
Each free-list page occupies a full page (default 4096 bytes) but stores only 7 bytes of useful data (3-byte zero header + 4-byte next pointer); the remaining bytes are zero padding from `write_page`.
- Source: entries/2026/05/29/topic-free-list-fragmentation.md

### fresh-sstables-always-start-at-level-zero [IN] OBSERVATION
`SSTableWriter.finish()` hardcodes `level=0` (sstable.py:104), matching LevelDB's invariant that flushed memtables always produce L0 files
- Source: entries/2026/05/29/topic-leveldb-version-set.md

### fsync-policy-inconsistent-across-critical-paths [IN] OBSERVATION
The codebase has a systematic fsync policy inconsistency between components: WAL critical operations (checkpoints, batch commits, segment rotations) always force-fsync regardless of configured mode, but B-tree metadata mutations (which are equally critical for crash recovery) skip fsync entirely despite paying double fsync for user data pages.

### fsync-used-only-for-appends [IN] OBSERVATION
`os.fsync` appears in the WAL append path and Bitcask record writes but never in SSTable creation, compaction output, or hint file writes — durability is applied to the append hot path only
- Source: entries/2026/05/29/topic-write-to-temp-then-rename-pattern.md

### full-data-lifecycle-unsafe-from-write-through-read [IN] OBSERVATION
The complete data lifecycle is unsafe from storage maintenance through data retrieval: compaction is the highest-risk operation that can permanently lose or resurrect deleted data, and the read path from SSTable through CDC to derived systems is unreliable at every stage, meaning data is at risk whether it is being reorganized for efficiency or being served to consumers.

### full-stack-restart-fragility [IN] OBSERVATION
Both the physical storage layer and the logical transaction layer are independently fragile under restart, creating a full-stack restart hazard: the B-tree's durability model protects user data pages but not structural metadata (height, sibling chain, free list), while the transaction system's isolation model depends on monotonic counters and abort-as-status-change semantics that have no persistence backing.

### gc-no-active-keeps-one [IN] OBSERVATION
When no transactions are active, `garbage_collect()` retains at most one version per key — the latest committed non-deleted version — and drops everything else including fully-deleted keys.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-garbage_collect.md

### gc-not-thread-safe [IN] OBSERVATION
`garbage_collect()` mutates `_versions` (replacing lists and deleting keys) without synchronization, assuming single-threaded execution — concurrent reads or writes during GC would race on the dict and its lists.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-garbage_collect.md

### gc-preserves-active-snapshots [IN] OBSERVATION
Garbage collection removes only versions unreachable by any active transaction, ensuring long-running read-only transactions see consistent data
- Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

### gc-unconditionally-purges-aborted [IN] OBSERVATION
`garbage_collect()` removes all versions with `created_by` in `_aborted` as its first step, regardless of active transaction state — aborted versions are invisible to everyone and always safe to drop.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-garbage_collect.md

### gc-uses-min-txid-horizon [IN] OBSERVATION
With active transactions, `garbage_collect()` uses `min(active tx_ids)` as the GC horizon; only versions superseded by a committed version with `tx_id < min_tx_id` are eligible for removal.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-garbage_collect.md

### get-pending-drains-atomically [IN] OBSERVATION
`get_pending_changes` returns the current `_pending` list and replaces it with an empty list, ensuring each change is delivered exactly once per replication cycle
- Source: entries/2026/05/29/topic-ring-topology-propagation.md

### gossip-bandwidth-scales-linearly-per-message [IN] OBSERVATION
Each gossip round transmits the entire membership list per exchange despite a fanout of only one peer per node, making per-message payload O(N) in cluster size and total per-round bandwidth O(N squared) across the cluster, even though only N point-to-point messages are sent.

### gossip-carries-membership-not-data [IN] OBSERVATION
`GossipNode.receive_gossip` merges heartbeat counters and node status (alive/suspected/dead) only; it never exchanges application key-value data, making it a SWIM-style failure detector rather than a data replication protocol
- Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

### gossip-cleanup-bounds-membership-growth [IN] OBSERVATION
Dead nodes are removed from the membership list after `t_cleanup` elapsed time (default 20), preventing unbounded growth of the membership table from accumulated failure records
- Source: entries/2026/05/29/topic-swim-protocol.md

### gossip-cleanup-removes-entry-entirely [IN] OBSERVATION
After `t_cleanup` elapses, a dead node's record is fully deleted from membership (not just flagged), which enables clean rejoin with a fresh identity
- Source: entries/2026/05/29/gossip-protocol-test_gossip_protocol.md

### gossip-cluster-uses-logical-time [IN] OBSERVATION
The gossip simulation uses explicit logical timestamps passed as parameters (not wall-clock time), with a fixed RNG seed for deterministic gossip partner selection, following the same simulated-time pattern as hinted-handoff
- Source: entries/2026/05/29/gossip-protocol-test_gossip_protocol.md

### gossip-convergence-is-olog-n [IN] OBSERVATION
Full membership convergence occurs within O(log N) gossip rounds, empirically bounded by `5 * log₂(N) + 5` rounds in the test suite
- Source: entries/2026/05/29/gossip-protocol-test_gossip_protocol.md

### gossip-dead-node-lifecycle-is-comprehensive [IN] OBSERVATION
Dead nodes in the gossip protocol follow a comprehensive, irreversible lifecycle with three independent safeguards: death status cannot be reversed by incoming gossip messages, dead node records are fully removed from the membership list after the cleanup interval (not merely flagged), and incoming gossip about already-dead nodes from other peers is silently filtered to prevent zombie reintroduction through stale state.

### gossip-dead-nodes-filtered-on-receive [IN] OBSERVATION
When receiving gossip, unknown nodes that arrive with `dead` status are silently dropped to prevent zombie membership entries from propagating through the cluster
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

### gossip-death-is-irreversible-via-gossip [IN] OBSERVATION
Once a node's status is `dead`, receiving a gossip message with `status: alive` for that node will never revert it — the merge logic at line 67 checks `local["status"] != "dead"` before allowing exoneration
- Source: entries/2026/05/29/topic-swim-protocol.md

### gossip-deep-copy-isolation [IN] OBSERVATION
All inter-node data transfer (`send_gossip`, `join`, `get_membership_list`) uses `copy.deepcopy` to prevent shared mutable state between simulated nodes
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

### gossip-detect-failures-is-stateless [IN] OBSERVATION
`detect_failures` makes decisions using only `current_time - timestamp_last_updated`; it consults no historical distribution or sliding window, making it a pure point-in-time comparison rather than a statistical inference.
- Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

### gossip-failure-detection-gates-replication [IN] OBSERVATION
`GossipNode.get_alive_members` returns the liveness set that anti-entropy, read-repair, and hinted-handoff all depend on to decide which nodes to contact — gossip provides the membership layer that every data-exchange layer needs
- Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

### gossip-failure-detection-governs-cluster-correctness [IN] OBSERVATION
Gossip-based failure detection is the single correctness bottleneck for the distributed cluster: replication, read repair, and hinted handoff all depend on its timeout-driven liveness set, which is bounded by cleanup to prevent unbounded membership growth.

### gossip-failure-detection-lacks-adaptive-accuracy [IN] OBSERVATION
Gossip failure detection is both the single correctness bottleneck for the cluster AND permanently miscalibrated: it governs all replication, read repair, and hinted handoff decisions, yet stores only the most recent timestamp per peer with no arrival history, making it unable to adapt detection thresholds to actual network conditions as phi-accrual detection would require.

### gossip-fanout-is-one [IN] OBSERVATION
Each node selects exactly one random peer per gossip round, with bidirectional exchange producing at most N pairwise syncs per round.
- Source: entries/2026/05/29/topic-topology-propagation-bounds.md

### gossip-fixed-three-threshold-fsm [IN] OBSERVATION
Failure detection uses exactly three fixed time thresholds (`t_suspect=5`, `t_dead=10`, `t_cleanup=20`) producing a deterministic state machine with no runtime adaptation to network conditions.
- Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

### gossip-merge-uses-heartbeat-monotonicity [IN] OBSERVATION
`receive_gossip` only accepts remote state when the remote heartbeat counter strictly exceeds the local counter, except for death notifications at equal counters which are also accepted
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

### gossip-no-arrival-history [IN] OBSERVATION
Each node stores only the most recent `timestamp_last_updated` per peer, discarding all inter-arrival time history that would be needed for probabilistic failure detection (e.g., phi accrual).
- Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

### gossip-node-status-lifecycle [IN] OBSERVATION
Node status follows `alive → suspected → dead → removed` with configurable timeouts `t_suspect`, `t_dead`, `t_cleanup` governing transitions; a suspected node can return to alive if a higher heartbeat counter arrives
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

### gossip-suspicion-is-timeout-based [IN] OBSERVATION
Suspicion in the gossip protocol is triggered by elapsed time exceeding `t_suspect` since the last heartbeat update, not by failed probe responses as in the full SWIM paper's ping/indirect-ping protocol
- Source: entries/2026/05/29/topic-swim-protocol.md

### gossip-topology-is-fully-connected [IN] OBSERVATION
GossipCluster uses unrestricted random peer selection (any node can gossip with any other), making the effective topology a fully-connected graph.
- Source: entries/2026/05/29/topic-topology-propagation-bounds.md

### gossip-uniform-thresholds [IN] OBSERVATION
All nodes in a `GossipCluster` receive identical threshold values from the cluster constructor at `__init__` (line 117); individual nodes cannot calibrate sensitivity to their specific network path.
- Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

### gossip-uses-full-membership-exchange [IN] OBSERVATION
Each gossip round transmits the entire membership list via `deepcopy` rather than deltas or piggybacked updates, making per-exchange bandwidth cost O(N) in cluster size instead of O(1) amortized
- Source: entries/2026/05/29/topic-swim-protocol.md

### gossip-voluntary-leave-broadcasts-to-all [IN] OBSERVATION
A leaving node sends its death status to every active peer (not just a random one) before deactivating, unlike normal gossip which uses random pairwise exchange
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

### handlers-must-mutate-state-in-place [IN] OBSERVATION
Both `reconstruct_state` and `Projection` expect handler functions to mutate the `state` dict in place rather than returning a new value; a handler that returns without mutating silently loses its changes
- Source: entries/2026/05/29/event-sourcing-store-event_store-reconstruct_state.md

### happens-before-requires-two-reaches-calls [IN] OBSERVATION
`happens_before` calls `_reaches` in both directions to distinguish "a before b", "b before a", and "concurrent" — a single call can only confirm or deny one direction.
- Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

### hash-index-all-keys-in-memory [IN] OBSERVATION
Both Bitcask implementations keep every live key in an in-memory dict (`keydir` / `_index`), meaning the key set must fit in RAM; there is no disk-based fallback for partial index spill.
- Source: entries/2026/05/29/topic-ddia-chapter3-hash-indexes.md

### hash-index-bitcask-no-checksum [IN] OBSERVATION
The hash-index-storage Bitcask records contain no CRC or checksum field, unlike the log-structured-hash-table implementation; on-disk corruption is undetectable during reads or compaction.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_write_record.md

### hash-index-bitcask-shared-read-handles [IN] OBSERVATION
`hash-index-storage/bitcask.py` uses a single cached file handle per segment for all reads via `_get_reader()`, making concurrent reads to the same segment unsafe due to shared seek position.
- Source: entries/2026/05/29/topic-reference-counted-file-handles.md

### hash-index-compaction-manual-only [IN] OBSERVATION
The `hash-index-storage/bitcask.py` `compact()` must be called explicitly by the caller; there is no auto-compact trigger, threshold tracking, or background compaction unlike the `log-structured-hash-table` variant
- Source: entries/2026/05/29/topic-bitcask-paper-comparison.md

### hash-index-get-tombstone-guard [IN] OBSERVATION
`hash-index-storage/bitcask.py`'s `get()` checks `if value == "": return None`, providing a defense-in-depth layer against tombstone leaks that `log-structured-hash-table`'s `get()` lacks
- Source: entries/2026/05/29/topic-tombstone-handling-in-hint-files.md

### hash-index-hint-self-sufficient [IN] OBSERVATION
The `hash-index-storage` hint file contains all four keydir fields (file_id, offset, size, timestamp), making it sufficient to fully reconstruct the in-memory index without reading data files
- Source: entries/2026/05/29/topic-bitcask-hint-file-format.md

### hash-index-is-memory-bound-by-design [IN] OBSERVATION
Hash index storage is fundamentally memory-bound: every key must reside in RAM for the single-seek O(1) read path, making dataset size directly constrained by available memory with no spill-to-disk fallback.

### hash-index-keys-must-fit-in-ram [IN] OBSERVATION
Both Bitcask implementations require every live key to be held in a Python dict in memory; the dataset's key space is bounded by available RAM
- Source: entries/2026/05/29/topic-bitcask-vs-lsm-tree.md

### hash-index-no-crc-by-design [IN] OBSERVATION
`hash-index-storage/bitcask.py` intentionally omits CRC to focus on keydir/append-log/compaction; `log-structured-hash-table/bitcask.py` provides the integrity-checking variant with `zlib.crc32` and `CorruptionError`
- Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

### hash-index-no-range-queries [IN] OBSERVATION
Neither hash index implementation (`hash-index-storage/bitcask.py` or `log-structured-hash-table/bitcask.py`) provides a `scan`, `range`, or ordered-iteration method; only exact-key point lookups are supported.
- Source: entries/2026/05/29/topic-ddia-chapter3-hash-indexes.md

### hash-index-read-is-single-seek [IN] OBSERVATION
A Bitcask `get()` does one dict lookup plus one positioned disk read (O(1)), while an LSM-tree `get()` may search the memtable then multiple SSTables from newest to oldest (O(log N) per level)
- Source: entries/2026/05/29/topic-bitcask-vs-lsm-tree.md

### hash-index-sync-writes-controls-fsync [IN] OBSERVATION
In hash-index-storage Bitcask, the `sync_writes` flag controls whether each `_write_record` call fsyncs to disk; when `False`, recently written records may be lost on crash.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_write_record.md

### hash-index-wall-clock-not-monotonic [IN] OBSERVATION
`_write_record` timestamps use `time.time()`, which is not monotonic — NTP adjustments or manual clock changes can cause a newer write to carry an older timestamp, confusing compaction's latest-wins conflict resolution.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_write_record.md

### hash-index-write-no-index-mutation [IN] OBSERVATION
`_write_record` in `hash-index-storage/bitcask.py` does not modify `keydir`; the caller (`put` or `delete`) is responsible for updating the in-memory index after the write returns.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_write_record.md

### hash-mod-destroys-key-order [IN] OBSERVATION
Hash-mod partitioning (`hash(k) % num_reducers`) scatters lexicographically adjacent keys across different partitions, making range queries require a full scatter-gather across all reducers; MapReduce `run()` re-sorts the final results to compensate
- Source: entries/2026/05/29/topic-hash-partitioning-skew.md

### hash-mod-partition-count-is-static [IN] OBSERVATION
In `mapreduce.py`, `num_reducers` is fixed at job creation (`num_reducers: int = 2`) and never changes during execution; all partition assignments are determined by `hash(k) % num_reducers` with no dynamic rebalancing
- Source: entries/2026/05/29/topic-hash-partitioning-skew.md

### hash-mod-skew-shared-across-modules [IN] OBSERVATION
Three modules use the same `hash(k) % num_partitions` pattern with fixed partition counts and no hot-key mitigation: `mapreduce.py:109`, `secondary_index_partitioning.py:56`, and `partitioned_log.py:149` — all share the straggler vulnerability where a single hot key overwhelms one partition
- Source: entries/2026/05/29/topic-hash-partitioning-skew.md

### hashlib-reserved-for-crypto-distribution [IN] OBSERVATION
The codebase consistently uses `zlib.crc32` for corruption detection in storage engines and `hashlib` (SHA-256) for content addressing (Merkle trees), uniform distribution (consistent hashing, bloom filters), and cryptographic integrity (BFT) — the two hash families never cross purposes
- Source: entries/2026/05/29/topic-crc32-vs-xxhash-choice.md

### header-corruption-derails-sequential-scan [IN] OBSERVATION
A corrupted `key_size` or `val_size` in `hash-index-storage/bitcask.py` causes `_scan_data_file` to read wrong byte boundaries, potentially misinterpreting all subsequent records in that file
- Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

### heap-tiebreak-uses-source-index [IN] OBSERVATION
The merge heap tuple includes a `source_index` field solely to prevent Python from comparing `SSTableEntry` objects when `(key, -timestamp)` ties, which would raise `TypeError`.
- Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

### hint-corrupt-worse-than-missing [IN] OBSERVATION
A corrupted hint file silently loads bad offsets into the index with no validation; a missing hint file triggers a CRC-checked full scan via `_scan_segment`, making hint corruption strictly worse than hint absence.
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_load_hint_file.md

### hint-expiry-bounds-durability [IN] OBSERVATION
Hints in `HintedHandoffStore` have a TTL; if `trigger_handoff` doesn't run before `created_at + hint_ttl`, `expire_hints` silently drops the hint, permanently losing that replica's copy of the data
- Source: entries/2026/05/29/topic-sloppy-quorum-tradeoffs.md

### hint-file-no-fsync [IN] OBSERVATION
Hint files are written without `fsync`; they are a best-effort startup optimization that can be lost on crash without data loss, since the data file remains authoritative.
- Source: entries/2026/05/28/hash-index-storage-bitcask-_write_hint_file.md

### hint-file-no-integrity-check [IN] OBSERVATION
Neither the hint file writer nor reader performs checksum validation; a truncated or corrupted hint file causes a `struct.error` or `IndexError` on startup rather than a graceful fallback to data file scanning.
- Source: entries/2026/05/28/hash-index-storage-bitcask-_write_hint_file.md

### hint-file-orphan-crash-risk [IN] OBSERVATION
In `hash-index-storage/bitcask.py`, a crash between deleting a data file and its hint file leaves an orphaned hint that directs reads to a nonexistent data file, causing hard failures on `get()`.
- Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

### hint-file-write-not-atomic [IN] OBSERVATION
Hint files are written directly to the final path (not via temp-file-and-rename), so a crash mid-generation leaves a partial hint file; `_load_hint_file` handles this by stopping at the first short read
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-create_hint_files.md

### hint-files-are-optimization-only [IN] OBSERVATION
Hint files accelerate index rebuild during recovery but carry no crash-consistency guarantees; they are not consulted to determine which data files constitute the valid current state
- Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

### hint-files-only-canonical-entries [IN] OBSERVATION
A hint entry is written only when `self._index[key]` confirms the key's canonical location is the current segment and offset, filtering out entries superseded by later writes or compaction
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-create_hint_files.md

### hint-format-diverges-between-implementations [IN] OBSERVATION
Hash-index hint uses a 24-byte fixed header per entry (`file_id:u32, offset:u64, size:u32, timestamp:f64`) while log-structured hint uses only an 8-byte header (`key_size:u32, offset:u32`), omitting file_id, timestamp, and record size
- Source: entries/2026/05/29/topic-bitcask-startup-cost.md

### hint-format-symmetry [IN] OBSERVATION
`_write_hint_file` and `_load_hint_file` must use identical binary layouts: `[HINT_FORMAT header (24 bytes)][key_len uint32 (4 bytes)][key_bytes (variable)]` per entry; any change to one without the other breaks backward compatibility.
- Source: entries/2026/05/28/hash-index-storage-bitcask-_write_hint_file.md

### hint-format-variable-length [IN] OBSERVATION
Each hint record is `HINT_HEADER_SIZE (24 bytes) + 4 (key_size uint32) + key_length` bytes; the `HINT_FORMAT` constant covers only the fixed 24-byte portion, not the full record.
- Source: entries/2026/05/29/hash-index-storage-bitcask-_load_hint_file.md

### hint-generation-no-crc-validation [IN] OBSERVATION
`create_hint_files` does not verify CRC checksums on the segment records it reads (unlike `_scan_segment`), so corrupted data can be silently indexed via hint files
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-create_hint_files.md

### hint-generation-skips-active-segment [IN] OBSERVATION
`create_hint_files` never generates a hint file for the active segment, only for frozen (immutable) segments, since a hint for the active segment would be immediately stale
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-create_hint_files.md

### hint-offset-is-segment-offset [IN] OBSERVATION
The `offset` value stored in a hint entry is a byte position in the corresponding `.dat` segment file, not a position within the hint file itself; `get()` uses this offset to seek directly into the segment.
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_load_hint_file.md

### hint-written-only-during-compaction [IN] OBSERVATION
`_write_hint_file` is called exclusively from `compact()`, never during normal `put`/`delete` operations; hint files only exist for compacted (merged) data files.
- Source: entries/2026/05/28/hash-index-storage-bitcask-_write_hint_file.md

### hinted-handoff-hint-cleanup-all-or-nothing [IN] OBSERVATION
`remove_hints_for()` purges all hints for a recovered target node regardless of whether individual hints were successfully delivered or had expired; after `trigger_handoff`, no hints for that node remain on any other node.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### hinted-handoff-hint-loss-is-silent-data-loss [IN] OBSERVATION
If a hint node crashes and its hints are lost before handoff, the target node never receives the data and no error is raised — requiring anti-entropy repair (e.g., Merkle trees) to detect and fix.
- Source: entries/2026/05/29/hinted-handoff-test_hinted_handoff.md

### hinted-handoff-hint-ttl-exclusive-boundary [IN] OBSERVATION
Hint expiration uses exclusive comparison: a hint created at time `t` with TTL `d` survives at `t + d - 1` but is expired at `t + d`.
- Source: entries/2026/05/29/hinted-handoff-test_hinted_handoff.md

### hinted-handoff-no-exceptions [IN] OBSERVATION
Write failures are signaled via `success: False` in the return dict; the hinted handoff module never raises exceptions in normal operation.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### hinted-handoff-one-hint-per-target-per-write [IN] OBSERVATION
Each unavailable preferred replica gets at most one hint stored on one non-preferred node per write operation, preventing hint explosion.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### hinted-handoff-reads-skip-non-preferred [IN] OBSERVATION
`get()` only queries preferred replicas, so data existing only as hints on non-preferred nodes is invisible to reads until handoff delivers it.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### hinted-handoff-sloppy-quorum-counts-hints [IN] OBSERVATION
When `sloppy_quorum=True` (the default), stored hints count toward `write_quorum` the same as direct replica writes, trading consistency for write availability.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### hinted-handoff-sloppy-quorum-writes-to-non-preferred [IN] OBSERVATION
When sloppy quorum is enabled and preferred replicas are down, writes succeed by routing to non-preferred substitute nodes that store hints for later delivery.
- Source: entries/2026/05/29/hinted-handoff-test_hinted_handoff.md

### hinted-handoff-strict-quorum-rejects-insufficient-preferred [IN] OBSERVATION
With `sloppy_quorum=False`, a write fails if fewer than `write_quorum` preferred nodes are available, even if non-preferred nodes could serve.
- Source: entries/2026/05/29/hinted-handoff-test_hinted_handoff.md

### hinted-handoff-time-is-parameter [IN] OBSERVATION
All time-dependent operations (`put`, `trigger_handoff`, `expire_all_hints`) take `current_time` as an explicit argument rather than calling `time.time()`, making the implementation fully deterministic and testable.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### hinted-handoff-ttl-bounds-hint-lifetime [IN] OBSERVATION
Hints expire after `created_at + ttl` (checked via `Hint.is_expired`); a node that stays down longer than the TTL window will not receive those hinted writes and must rely on anti-entropy for convergence.
- Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

### hinted-handoff-version-monotonic-per-key [IN] OBSERVATION
The coordinator assigns strictly increasing versions per key via a centralized counter, so stale hint replays are absorbed harmlessly by `Node.put()`'s `version >= existing_version` check.
- Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

### index-is-volatile [IN] OBSERVATION
All three storage engines rebuild their in-memory index (`keydir`, `_index`) entirely from on-disk files at startup; inconsistent file state after a crash produces a silently incomplete index with no error
- Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

### init-meta-unprotected-by-wal [IN] OBSERVATION
During initial file creation, `PageManager.__init__` writes metadata and the root leaf page directly via `_write_meta` without WAL protection; acceptable because no user data exists yet to lose.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_write_meta.md

### input-validation-systematically-absent [IN] OBSERVATION
Input validation is systematically absent across configuration and API boundaries: the WAL accepts arbitrary strings as sync mode (silently disabling all durability guarantees), truncate accepts out-of-range sequence numbers (silently deleting all data), compaction strategy selection silently falls through to leveled on any unrecognized string, and Merkle proof direction is never validated — misconfiguration produces incorrect behavior rather than failing fast.

### integrity-checking-is-both-incomplete-and-unrecoverable [IN] OBSERVATION
Data integrity checking fails in two complementary ways: CRC checksums exclude routing metadata (sequence numbers, page numbers, headers), so corruption in those fields goes undetected; and every reader halts at the first detected CRC failure with no resync capability — meaning both detectable and undetectable corruption are terminal.

### integrity-degrades-along-storage-pipeline [IN] OBSERVATION
Data integrity verification degrades monotonically along the storage pipeline: WAL records have CRC checksums that exclude routing metadata (partial coverage), SSTables have no checksums at all (zero coverage), and neither layer can recover from detected corruption — data transitions from partially verified to completely unverified as it moves through compaction from WAL to SSTable.

### internal-node-children-invariant [IN] OBSERVATION
_serialize_internal assumes len(children) == len(keys) + 1 but does not assert it; too few children raises IndexError, while extra children beyond keys+1 are silently dropped — a potential data-loss path if the invariant is violated.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_serialize_internal.md

### internal-nodes-no-sibling-pointer [IN] OBSERVATION
Internal nodes have no sibling pointer field; their binary layout is `[type:1B][num_keys:2B][child₀:4B]([keylen:2B][key][child:4B])...` with no right-link, and `_deserialize_internal` returns a 2-tuple `(keys, children)` vs. the leaf's 3-tuple `(keys, values, next_sibling)`.
- Source: entries/2026/05/29/topic-internal-node-format.md

### internal-page-binary-layout [IN] OBSERVATION
Internal page wire format is [type:1B][num_keys:2B][child0:4B] followed by repeating [keylen:2B][key:varB][child:4B], all big-endian with no alignment padding; this interleaved layout mirrors the logical separator-pointer structure.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_serialize_internal.md

### iter-descends-left-spine [IN] OBSERVATION
`__iter__` reaches the leftmost leaf by following `children[0]` at every internal level, then walks the sibling chain to `NO_SIBLING` to enumerate all keys in sorted order
- Source: entries/2026/05/29/topic-split-during-insert.md

### iter-no-counter-reset [IN] OBSERVATION
Unlike get, put, delete, and range_scan, BTree.__iter__ does not call reset_counters() before running, so I/O page-read stats accumulate on top of any prior count — a gotcha when benchmarking iteration.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-__iter__.md

### iterate-flushes-without-fsync [IN] OBSERVATION
Before reading, `iterate()` calls `fd.flush()` but not `os.fsync()`, so buffered writes reach the OS page cache but are not guaranteed durable on stable storage.
- Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

### iterate-not-snapshot-isolated [IN] OBSERVATION
`iterate()` releases the WAL lock after flushing but before reading begins, so concurrent appends may or may not be visible to the iterator depending on timing.
- Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

### iterate-stops-silently-at-corruption [IN] OBSERVATION
When `iterate()` encounters a CRC mismatch, iteration stops with no exception or sentinel value — the caller receives all valid records before the corruption with no indication that more data existed.
- Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

### iterate-yields-all-op-types [IN] OBSERVATION
`iterate()` yields records of every op type (PUT, DELETE, COMMIT, CHECKPOINT) while `replay()` filters to only PUT and DELETE records.
- Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

### join-determinism-from-event-time [IN] OBSERVATION
The stream join processor's output is fully determined by the sequence of `(stream_name, key, value, timestamp)` inputs and `advance_time` calls, with no dependency on wall-clock time — making it deterministically testable without clock mocking
- Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

### join-uses-symmetric-interval-not-tumbling [IN] OBSERVATION
`TimeWindow.contains()` uses `abs(t1 - t2) <= duration`, making join matching a symmetric interval check centered on each event, not aligned to any fixed time grid
- Source: entries/2026/05/29/topic-sliding-vs-tumbling-windows.md

### join-window-and-aggregation-window-are-independent [IN] OBSERVATION
The join `TimeWindow` duration controls which event pairs can match, while the `TumblingWindowAggregator` window size controls result grouping; these are independent parameters that can differ without violating any invariant
- Source: entries/2026/05/29/topic-sliding-vs-tumbling-windows.md

### key-range-merkle-diff-returns-key-names [IN] OBSERVATION
`KeyRangeMerkleTree.diff_keys()` returns the string key names of divergent entries, translating numeric leaf indices back to keys
- Source: entries/2026/05/29/merkle-tree-test_merkle.md

### key-range-merkle-sorts-by-key [IN] OBSERVATION
`KeyRangeMerkleTree` sorts input pairs by key before building, so two trees with identical key-value content always produce the same root hash regardless of insertion order
- Source: entries/2026/05/29/merkle-tree-merkle_tree.md

### lamport-happens-before-returns-tristate [IN] OBSERVATION
`happens_before(a, b, events)` returns `True` (a causes b), `False` (b causes a), or `None` (concurrent) — modeling a strict partial order with explicit concurrency detection.
- Source: entries/2026/05/29/lamport-clocks-test_lamport.md

### lamport-happens-before-uses-graph-not-timestamps [IN] OBSERVATION
`happens_before()` determines causality by BFS over `_parent`/`_cause` edges in the event DAG, not by comparing timestamps; the `all_events` parameter is accepted but never used.
- Source: entries/2026/05/29/lamport-clocks-lamport.md

### lamport-mutex-lowest-timestamp-priority [IN] OBSERVATION
`LamportMutex` grants critical section access to the requesting node with the lowest timestamp, with other requesters waiting until the holder releases.
- Source: entries/2026/05/29/lamport-clocks-test_lamport.md

### lamport-mutex-requires-all-acks [IN] OBSERVATION
`can_enter()` returns `True` only when the node's request is at the queue head (lowest `(timestamp, node_id)`) AND acknowledgments have been received from every other node in the system.
- Source: entries/2026/05/29/lamport-clocks-lamport.md

### lamport-receive-tick-guarantees-causal-order [IN] OBSERVATION
`receive_tick` computes `max(local, received) + 1`, ensuring every receive event has a timestamp strictly greater than the corresponding send event, which is the core Lamport clock invariant.
- Source: entries/2026/05/29/lamport-clocks-lamport.md

### lamport-receive-tick-max-merge [IN] OBSERVATION
`LamportClock.receive_tick(remote_ts)` computes `max(local, remote_ts) + 1`, ensuring the clock advances past both the local and remote state on every receive.
- Source: entries/2026/05/29/lamport-clocks-test_lamport.md

### lamport-send-creates-receive [IN] OBSERVATION
`Node.send_message(target, payload)` has a side effect on the target node: it appends a RECEIVE event to the target's event log and advances the target's clock.
- Source: entries/2026/05/29/lamport-clocks-test_lamport.md

### lamport-send-delivers-synchronously [IN] OBSERVATION
`send_message` calls `to_node.receive_message()` directly in the same call stack, making message delivery instantaneous and deterministic with no async queue or network simulation.
- Source: entries/2026/05/29/lamport-clocks-lamport.md

### lamport-timestamp-order-not-implies-causality [IN] OBSERVATION
`a.timestamp < b.timestamp` does NOT imply a→b; two concurrent events can have ordered timestamps, which is a fundamental limitation of Lamport clocks that vector clocks address.
- Source: entries/2026/05/29/lamport-clocks-lamport.md

### lamport-total-order-breaks-ties-by-node-id [IN] OBSERVATION
`total_order()` sorts by `(timestamp, node_id)`, using lexicographic node ID comparison as a deterministic tiebreaker when timestamps are equal.
- Source: entries/2026/05/29/lamport-clocks-lamport.md

### lamport-total-order-tiebreak-by-node-id [IN] OBSERVATION
`total_order()` sorts events by `(timestamp, node_id)`, using the node identifier as a deterministic tiebreaker for concurrent events with equal timestamps.
- Source: entries/2026/05/29/lamport-clocks-test_lamport.md

### late-event-drop-is-hard-cutoff [IN] OBSERVATION
Events with `timestamp < watermark - allowed_lateness` are unconditionally dropped and counted in `stats.late_events_dropped`; there is no secondary path to recover or re-buffer them
- Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

### lcs-compact-one-per-call [IN] OBSERVATION
`_lcs_compact` performs at most one merge operation per invocation and returns immediately; cascading compactions across levels require the caller to loop on `run_compaction()`.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_lcs_compact.md

### lcs-l0-compacts-all [IN] OBSERVATION
L0 compaction always includes every L0 SSTable in the merge set (not just a subset), which can cause large write spikes when many L0 SSTables accumulate before the trigger fires.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_lcs_compact.md

### lcs-l0-triggers-on-count-not-size [IN] OBSERVATION
Level 0 in leveled compaction triggers based on SSTable count (`l0_compaction_trigger`), not total size, because L0 SSTables can have overlapping key ranges unlike higher levels which maintain non-overlapping invariants
- Source: entries/2026/05/29/topic-stcs-vs-lcs-tradeoffs.md

### lcs-level-size-exponential [IN] OBSERVATION
Level size budgets grow as `base_size × fanout^(level-1)`, so with defaults of 10MB base and 10× fanout: L1=10MB, L2=100MB, L3=1GB, etc.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_lcs_compact.md

### lcs-levels-grow-exponentially [IN] OBSERVATION
Leveled compaction sizes each level as `level_base_size * fanout^(level-1)` with a 10MB default base and 7 max levels, so each level is 10x the previous by default
- Source: entries/2026/05/29/topic-stcs-vs-lcs-tradeoffs.md

### lcs-overlap-selection-is-quadratic [IN] OBSERVATION
During level N compaction, `_lcs_compact` calls `_overlapping` once per SSTable in the level to find the one with maximum overlap, then once more for the winner — making SSTable selection O(n*m) in level sizes.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_overlapping.md

### lcs-picks-max-overlap-sstable [IN] OBSERVATION
For level N→N+1 compaction, the SSTable with the most overlapping neighbors in the next level is chosen — the opposite of LevelDB/RocksDB's least-overlap heuristic, increasing write amplification per compaction.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

### lcs-produces-single-output-sstable [IN] OBSERVATION
Leveled compaction merges all inputs into one SSTable rather than splitting output by size, which means levels above L0 will contain overlapping key ranges after compaction — violating the real LCS invariant that each level has non-overlapping SSTables.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

### leaf-chain-correct-single-threaded [IN] OBSERVATION
The `next_sibling` leaf chain is correctly maintained through splits because the WAL logs the right page before the left page (ensuring the pointer target is durable before anything references it) and no concurrent reader can observe intermediate states
- Source: entries/2026/05/29/topic-concurrency-and-latch-coupling.md

### leaf-entry-cost-is-variable [IN] OBSERVATION
Each leaf entry costs `4 + len(key_bytes) + len(value_bytes)` bytes (2B key length + key + 2B value length + value), so maximum keys per page depends on actual data sizes, not just a fixed `max_keys` count.
- Source: entries/2026/05/29/topic-page-overflow-and-split-mechanics.md

### leaf-nodes-store-all-values [IN] OBSERVATION
Values are stored exclusively in leaf nodes; internal nodes contain only routing keys and child page pointers, making this a B+-tree not a B-tree
- Source: entries/2026/05/29/topic-b-plus-tree-leaf-chains.md

### leaf-serialized-overhead-is-7-bytes [IN] OBSERVATION
Every leaf page has a fixed 7-byte overhead (3B header via `HEADER_FMT = '>BH'` + 4B `next_sibling` pointer) before any key-value entries, setting the usable capacity to `page_size - 7` bytes.
- Source: entries/2026/05/29/topic-page-overflow-and-split-mechanics.md

### leaf-sibling-not-blink [IN] OBSERVATION
Leaf `next_sibling` pointers serve range scans only, not Lehman & Yao concurrent-split recovery; internal nodes have no right-links, making this a standard B+ tree, not a B-link tree
- Source: entries/2026/05/29/topic-lehman-yao-1981.md

### leaf-split-updates-sibling-pointers [IN] OBSERVATION
When a leaf splits, the new page inherits the old page's `next_sibling` pointer and the old page's `next_sibling` is updated to point to the new page — two field writes during an operation that already rewrites both pages
- Source: entries/2026/05/29/topic-leaf-sibling-chains.md

### leaf-wire-format-is-header-sibling-entries [IN] OBSERVATION
Leaf pages are laid out as `[type:1B][num_keys:2B][next_sibling:4B][entries...]` where each entry is `[key_len:2B][key][val_len:2B][val]`, all big-endian
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_serialize_leaf.md

### lehman-yao-needs-high-keys-and-internal-links [IN] OBSERVATION
The existing leaf `next_sibling` pointer is necessary but insufficient for Lehman-Yao concurrent access; the algorithm also requires right-links on internal nodes and a high key per node, neither of which this implementation has
- Source: entries/2026/05/29/topic-concurrency-and-latch-coupling.md

### leveled-compaction-promotes-to-l1 [IN] OBSERVATION
After leveled compaction runs on L0 SSTables, the output readers have `level == 1` set by the caller.
- Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

### leveled-compaction-promotes-to-next-level [IN] OBSERVATION
After leveled compaction, resulting SSTables are assigned `level = max_level + 1` where `max_level` is the highest level among inputs, establishing the level hierarchy (observable in test assertion at `test_sstable.py:119`).
- Source: entries/2026/05/29/topic-leveled-compaction-write-amplification.md

### leveled-triggers-on-l0-count [IN] OBSERVATION
Leveled compaction triggers when the L0 SSTable count meets the `l0_compaction_trigger` parameter
- Source: entries/2026/05/29/sstable-and-compaction-tester_test_sstable.md

### linearizable-read-requires-consensus [IN] OBSERVATION
`LinearizableRegister.read()` broadcasts through the TOB consensus layer rather than reading local state, because a node cannot distinguish "I have the latest value" from "I'm partitioned and stale" without majority confirmation.
- Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

### live-projection-exactly-once-via-position-guard [IN] OBSERVATION
Live projections achieve exactly-once processing semantics through three reinforcing mechanisms: position-guarded deduplication prevents double-processing during overlapping catch-up and subscription, position advances even for events with no registered handler (preventing future re-evaluation), and the catch-up-then-subscribe pattern eliminates gaps between historical replay and live events.

### live-projection-full-replay-on-restart [IN] OBSERVATION
A `LiveProjection` must replay all events from position 0 on initialization because there is no snapshot restore path wired into its construction
- Source: entries/2026/05/29/topic-projection-snapshot-interaction.md

### live-projection-idempotent-by-position [IN] OBSERVATION
`LiveProjection._on_event` guards against double-processing via `event_id <= _position`, making it safe to overlap `catch_up()` and live subscription without duplicate handler invocations.
- Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

### live-projection-is-synchronous [IN] OBSERVATION
`LiveProjection` processes events synchronously inside the `append()` call path via the `_subscribers` mechanism, meaning the caller blocks until all live projections have updated
- Source: entries/2026/05/29/event-sourcing-store-event_store.md

### live-projection-no-error-boundary [IN] OBSERVATION
Handler exceptions in `LiveProjection._on_event` propagate through `EventStore.append()` to the caller; the projection's `_position` is not updated, leaving it behind but recoverable via `catch_up()`.
- Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

### live-projection-no-snapshot-interval [IN] OBSERVATION
`LiveProjection` is never constructed with a `snapshot_interval`, so the automatic snapshot path in `catch_up()` never triggers for live projections
- Source: entries/2026/05/29/topic-projection-snapshot-interaction.md

### live-projection-no-unsubscribe [IN] OBSERVATION
Once constructed, a `LiveProjection` cannot be unsubscribed from its store; the callback reference persists in `_subscribers` indefinitely with no removal mechanism.
- Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

### live-projection-position-advances-without-handler [IN] OBSERVATION
Events with no registered handler still advance `LiveProjection._position`, preventing re-evaluation on subsequent `catch_up()` calls — position tracks last *seen* event, not last *handled* event.
- Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

### live-projection-subscribes-before-catchup [IN] OBSERVATION
`LiveProjection.__init__` registers the subscriber before the user calls `catch_up()`, making duplicate processing possible but event loss impossible in the current single-threaded design
- Source: entries/2026/05/29/topic-catch-up-subscription-gap.md

### live-projection-uses-subscriber-list [IN] OBSERVATION
`LiveProjection` receives automatic updates by registering a callback in `EventStore._subscribers` (a list of `Callable[[Event], None]]`), which is invoked synchronously inside `append()` and `append_batch()`
- Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

### local-writes-self-immunize [IN] OBSERVATION
Both `put` and `delete` call `_record_seen` with the local node's ID, ensuring the originating node drops its own changes when they loop back through the ring
- Source: entries/2026/05/29/topic-ring-topology-propagation.md

### lock-expiry-enables-reacquisition-without-release [IN] OBSERVATION
`LockService.acquire` checks `is_expired(current_time)` before rejecting a competing acquire, so a crashed or GC-paused client's lock is automatically reclaimable after TTL without requiring explicit release
- Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

### locks-block-future-transactions [IN] OBSERVATION
A participant in the `"prepared"` state holds key-level locks (`self.locks[key] = tx_id`) that cause any subsequent transaction touching the same keys to abort with a lock conflict during its own `prepare()`.
- Source: entries/2026/05/29/topic-2pc-blocking-problem.md

### log-structured-bitcask-no-fsync [IN] OBSERVATION
`log-structured-hash-table/bitcask.py` uses only `flush()` without `os.fsync()` on data writes, unlike `hash-index-storage` which calls `os.fsync` per record when `sync_writes` is enabled
- Source: entries/2026/05/29/topic-bitcask-paper-design.md

### log-structured-compact-closes-active-file [IN] OBSERVATION
In `log-structured-hash-table/bitcask.py`, compaction closes `_active_file` during its write phase, making all reads and writes to the active segment impossible until the active file is reopened under a new name at the end of compaction
- Source: entries/2026/05/29/topic-concurrent-merge-safety.md

### log-structured-hint-creation-decoupled [IN] OBSERVATION
The log-structured Bitcask exposes `create_hint_files()` as a standalone operation callable independently of compaction, while the hash-index variant only writes hint files inside `compact()`
- Source: entries/2026/05/29/topic-bitcask-startup-cost.md

### log-structured-hint-creation-is-manual [IN] OBSERVATION
In `log-structured-hash-table/bitcask.py`, hint file creation requires an explicit `create_hint_files()` call rather than being produced automatically as a side-effect of compaction
- Source: entries/2026/05/29/topic-bitcask-paper-design.md

### log-structured-hint-offset-32bit-cap [IN] OBSERVATION
Hint entry format uses `!II` (two `u32` network-order integers), capping segment file offsets at 4 GiB regardless of `max_segment_size` configuration.
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-_load_hint_file.md

### log-structured-index-omits-size-and-timestamp [IN] OBSERVATION
The `log-structured-hash-table` Bitcask variant stores only `(filepath, offset)` in its index, omitting record size and timestamp, requiring an extra header parse on every `get()` to discover record boundaries
- Source: entries/2026/05/29/topic-bitcask-paper-design.md

### log-structured-sentinel-is-in-band [IN] OBSERVATION
`log-structured-hash-table/bitcask.py` uses `TOMBSTONE = b"__BITCASK_TOMBSTONE__"` as an in-band sentinel in the value space; storing those exact bytes as a value causes silent data loss on recovery when the record is misinterpreted as a deletion
- Source: entries/2026/05/29/topic-tombstone-semantics.md

### lookup-cost-is-logarithmic-in-total-vnodes [IN] OBSERVATION
`get_node` performs a single `bisect.bisect` over `_ring_positions`, so key lookup is O(log(N×V)) regardless of cluster size.
- Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

### lsht-get-no-tombstone-guard [IN] OBSERVATION
`BitcaskStore.get()` in `log-structured-hash-table/bitcask.py` returns raw payload bytes without checking for the `TOMBSTONE` sentinel, so an index entry pointing to a tombstone record surfaces `b"__BITCASK_TOMBSTONE__"` as a value
- Source: entries/2026/05/29/topic-tombstone-handling-in-hint-files.md

### lsht-hint-no-tombstone-flag [IN] OBSERVATION
The `log-structured-hash-table` hint entry format (`!II` = key_size + offset) has no field to distinguish live records from tombstones, making `_load_hint_file` structurally unable to replicate `_scan_segment`'s deletion logic
- Source: entries/2026/05/29/topic-tombstone-handling-in-hint-files.md

### lsm-and-sstable-have-no-checksums [IN] OBSERVATION
Neither `log-structured-merge-tree/lsm.py` nor `sstable-and-compaction/sstable.py` compute or verify any checksums; a single bit-flip in a length-prefix field causes cascading misframing of all subsequent records
- Source: entries/2026/05/29/topic-crc-coverage-audit.md

### lsm-compact-no-atomic-rename [IN] OBSERVATION
The LSM tree's `compact()` method does not use `os.rename` or `os.replace`; grep for atomic rename operations returns zero matches in the LSM module, meaning the compaction output is written directly to its final path with no atomic swap.
- Source: entries/2026/05/29/topic-crash-recovery-testing.md

### lsm-compact-removes-tombstones-safely [IN] OBSERVATION
The LSM Tree's `compact()` method removes tombstones because it performs a full merge of all SSTables, guaranteeing no surviving SSTable can contain a superseded live value
- Source: entries/2026/05/28/topic-tombstone-gc-safety.md

### lsm-compaction-blocks-callers [IN] OBSERVATION
`LSMTree.compact()` runs synchronously in the caller's thread, meaning a `put()` or `delete()` that triggers the SSTable count threshold will block until the full k-way merge completes.
- Source: entries/2026/05/29/topic-bitcask-merge-window-scheduling.md

### lsm-compaction-deletes-without-fsync-barrier [IN] OBSERVATION
The LSM tree's `compact()` calls `os.remove(sst.path)` without an explicit `fsync` on the new SSTable or its parent directory beforehand, meaning the new file's data may not be durable when the old file is unlinked
- Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

### lsm-compaction-duplicates-safe [IN] OBSERVATION
A crash during LSM compaction produces duplicate entries (old and merged SSTables both present) rather than data loss, because newer SSTables take read precedence.
- Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

### lsm-compaction-is-full-merge [IN] OBSERVATION
`compact()` merges all SSTables into a single new SSTable (size-tiered, single-level), not incremental or leveled — simpler but with higher space amplification.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-compaction-last-writer-wins [IN] OBSERVATION
Both LSM implementations resolve key conflicts during compaction by keeping only the newest value; older values are unconditionally discarded with no user-defined merge logic
- Source: entries/2026/05/29/topic-rocksdb-merge-operator.md

### lsm-crash-recovery-impossible-without-manifest [IN] OBSERVATION
LSM crash recovery is structurally impossible: the set of live SSTable files exists only in memory with no persistent manifest, so a crash permanently loses the SSTable list, and the WAL that could help reconstruct state has no integrity checking to validate its own records during replay.

### lsm-crash-test-ignores-compaction [IN] OBSERVATION
`test_crash_recovery` in `test_lsm.py` only covers WAL replay for unflushed memtable entries; no test exercises crashes during SSTable compaction, leaving the most dangerous data-loss window (mid-compaction file deletion) untested.
- Source: entries/2026/05/29/topic-crash-recovery-testing.md

### lsm-flat-compaction-always-bottommost [IN] OBSERVATION
`LSMTree.compact()` in `lsm.py` merges all SSTables into one without level hierarchy, making every compaction implicitly a bottommost-level operation where tombstone removal is always safe
- Source: entries/2026/05/29/topic-rocksdb-bottommost-compaction.md

### lsm-flush-double-loss-window [IN] OBSERVATION
The LSM `_flush()` truncates the WAL after writing an SSTable, but since neither the SSTable write nor the truncation is fsynced, a crash can lose both the WAL source data and the SSTable destination simultaneously — making the data irrecoverable
- Source: entries/2026/05/29/topic-bitcask-crash-recovery-guarantees.md

### lsm-forward-only-iteration [IN] OBSERVATION
SSTable iterators (`scan`, `scan_all`) support only forward iteration; there is no `Prev()` or reverse scan capability, preventing `ORDER BY DESC` or backward cursor pagination
- Source: entries/2026/05/29/topic-leveldb-merging-iterator.md

### lsm-get-probes-all-sstables-on-miss [IN] OBSERVATION
`LSMTree.get()` iterates through every SSTable in reverse sequence order and returns only after checking all of them when a key is absent; without bloom filters, missing-key lookups are O(N) in SSTable count
- Source: entries/2026/05/29/topic-bloom-filters-for-read-optimization.md

### lsm-heapq-imported-unused [IN] OBSERVATION
`heapq` is imported but never referenced in the code; `compact()` uses list sorting (`sort by (key, -seq)`) instead of a heap-based k-way merge.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-memtable-swap-is-reference-not-copy [IN] OBSERVATION
`_flush` line 307 `frozen = self._memtable` captures a Python reference to the SortedDict, not a deep copy, so concurrent mutation of the dict after the swap would be unsafe without synchronization
- Source: entries/2026/05/29/topic-concurrent-flush-safety.md

### lsm-memtable-threshold-triggers-flush [IN] OBSERVATION
When the number of entries in the active memtable reaches `memtable_threshold`, the memtable is automatically frozen and flushed to a new SSTable on disk
- Source: entries/2026/05/28/log-structured-merge-tree-test_lsm.md

### lsm-miss-probes-all-due-to-no-bloom-integration [IN] OBSERVATION
The LSM tree scans every SSTable on negative lookups because the Bloom filter module — which is correctly implemented with textbook-optimal sizing — is never wired into the read path.

### lsm-newest-first-read-path [IN] OBSERVATION
`get()` searches memtable, then immutable memtables, then SSTables in reverse-seq order (newest first), returning the first match; this guarantees newer writes shadow older ones without scanning all levels.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-no-concurrency-control [IN] OBSERVATION
The LSM tree in `lsm.py` has no locking, reference counting, or Version snapshots; `compact()` mutates `self._sstables` in place, so concurrent readers could see inconsistent state or reference deleted SSTables
- Source: entries/2026/05/29/topic-mvcc-version-snapshots.md

### lsm-no-manifest [IN] OBSERVATION
The LSM tree has no persistent metadata log (MANIFEST); the set of live SSTable files exists only in memory (`self._sstables`) and is not crash-recoverable independently of directory scanning
- Source: entries/2026/05/28/topic-manifest-based-compaction.md

### lsm-no-synchronization [IN] OBSERVATION
`LSMTree` in `lsm.py` has zero locking, atomic swaps, or synchronization primitives; all shared state (`_memtable`, `_sstables`) is mutated in-place without protection
- Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

### lsm-no-write-counters [IN] OBSERVATION
The LSM-tree engine has no built-in byte or operation counters for writes; all write instrumentation (WAL appends, SSTable flushes, compaction output) must be added from scratch
- Source: entries/2026/05/29/topic-write-amplification-measurement.md

### lsm-range-scan-materializes-all [IN] OBSERVATION
`lsm.py:range_scan` loads all matching entries from every SSTable and memtable into a dict before returning any results, making first-result latency proportional to total result set size
- Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

### lsm-replay-strips-seq-nums [IN] OBSERVATION
The LSM tree's `replay` method (`lsm.py:28`) returns `List[Tuple[str, bytes]]`, discarding WAL sequence numbers and making gap detection impossible at the LSM layer
- Source: entries/2026/05/29/topic-gap-detection-during-replay.md

### lsm-requires-wal-hash-does-not [IN] OBSERVATION
The LSM tree uses a separate WAL (`lsm.py:14-64`) for crash recovery because its memtable is volatile; the hash index writes directly to the append-only data file, which serves as its own recovery log and needs no WAL.
- Source: entries/2026/05/29/topic-ddia-chapter3-hash-indexes.md

### lsm-sparse-index-default-16 [IN] OBSERVATION
The SSTable sparse index samples every 16th key by default; a point lookup binary-searches the sparse index then linear-scans up to 16 entries within the candidate block.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-sstables-unprotected-mutation [IN] OBSERVATION
`LSMTree._sstables` is mutated by both `_flush()` (append) and `compact()` (full replacement) with no synchronization, versioning, or ref counting, making concurrent reads unsafe if threading or async is added
- Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

### lsm-tests-single-threaded-only [IN] OBSERVATION
The LSM test suite (`test_lsm.py`) runs all operations sequentially with no concurrency, so race conditions between `_flush()`, `compact()`, and `range_scan()` are never exercised
- Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

### lsm-tombstone-is-empty-bytes [IN] OBSERVATION
Deletion is represented as `TOMBSTONE = b""` (empty bytes), which means empty-byte-string values and deleted keys are indistinguishable at the storage layer.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-tree-has-no-level-concept [IN] OBSERVATION
`LSMTree` in lsm.py maintains a flat list of SSTables ordered by sequence number with no level assignment; compaction merges all files into one rather than promoting between levels
- Source: entries/2026/05/29/topic-leveldb-version-set.md

### lsm-two-merge-strategies [IN] OBSERVATION
Compaction uses `heapq`-based k-way merge with `prev_key` deduplication (lsm.py ~line 323), while `range_scan` uses a dict-based materialize-everything approach (lines 275–298) — two fundamentally different merge strategies in the same codebase
- Source: entries/2026/05/29/topic-leveldb-merging-iterator.md

### lsm-uses-sortedcontainers [IN] OBSERVATION
The LSM memtable uses `sortedcontainers.SortedDict` — the only external dependency across the four storage engine modules examined so far.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-wal-before-memtable [IN] OBSERVATION
Every `put()` and `delete()` writes to the WAL before inserting into the memtable, ensuring no acknowledged write is lost on crash.
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lsm-wal-crash-tolerant-replay [IN] OBSERVATION
LSM WAL `replay()` silently discards any trailing partial record by checking remaining bytes at every parse step, making it tolerant of process crashes during `append()`
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-WAL.md

### lsm-wal-entry-size [IN] OBSERVATION
Each LSM WAL entry is exactly `8 + len(key_utf8) + len(value)` bytes: two 4-byte big-endian length headers plus the raw payloads, with no checksum or operation-type field
- Source: entries/2026/05/29/topic-write-amplification-measurement.md

### lsm-wal-has-no-checksums [IN] OBSERVATION
The LSM tree's WAL (`lsm.py:13-63`) uses length-prefixed records with no CRC, relying solely on short-read detection for crash recovery; distinct from the separate `lsm-wal-has-no-fsync` issue
- Source: entries/2026/05/29/topic-wal-corruption-models.md

### lsm-wal-has-no-commit-markers [IN] OBSERVATION
The LSM WAL (`lsm.py:13-65`) appends bare key-value pairs with no commit or transaction boundaries, safe only because structural changes (compaction, SSTable creation) use atomic file operations outside the WAL
- Source: entries/2026/05/29/topic-wal-operation-boundaries.md

### lsm-wal-has-no-fsync [IN] OBSERVATION
The WAL class in `log-structured-merge-tree/lsm.py` calls `flush()` but never `os.fsync()`, making it strictly weaker than the standalone WAL which offers configurable sync modes
- Source: entries/2026/05/28/topic-directory-fsync-semantics.md

### lsm-wal-has-no-integrity-check [IN] OBSERVATION
The WAL in `log-structured-merge-tree/lsm.py` uses only length-prefix framing with no CRC or checksum; a single bit flip in the length field causes silent data corruption or misaligned reads
- Source: entries/2026/05/29/topic-block-aligned-wal-records.md

### lsm-wal-no-checksums [IN] OBSERVATION
The LSM WAL format has no checksums or magic bytes; it can detect truncation (incomplete length-prefixed records) but not corruption of existing bytes
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-WAL.md

### lsm-wal-no-crc [IN] OBSERVATION
The LSM tree's WAL (`lsm.py:14-53`) uses only length-prefixed framing with no checksum, making it unable to distinguish corruption from valid data unless a length field causes an out-of-bounds read
- Source: entries/2026/05/29/topic-record-boundary-recovery.md

### lsm-wal-replays-on-reopen [IN] OBSERVATION
Constructing a new `LSMTree` on an existing directory replays the WAL to recover unflushed memtable state; this is validated by the crash recovery test which skips `close()` and reopens
- Source: entries/2026/05/28/log-structured-merge-tree-test_lsm.md

### lsm-wal-strictly-weaker-than-standalone-wal [IN] OBSERVATION
The LSM tree's built-in WAL provides strictly weaker guarantees than the standalone WAL module along two independent axes: it has no CRC or checksum for integrity (vs per-record CRC32 in the standalone WAL), and its truncation zeroes the entire file instantly via wb mode (vs careful record-by-record rewriting), meaning corruption goes undetected and truncation is all-or-nothing rather than surgical.

### lsm-wal-truncate-destroys-immediately [IN] OBSERVATION
The LSM tree's `WAL.truncate()` opens the file with `"wb"` mode which zeroes all content instantly, with no intermediate durable state to recover from if a crash occurs before the file is reopened for appending
- Source: entries/2026/05/29/topic-crash-safety-of-truncate.md

### lsm-wal-truncate-no-arg [IN] OBSERVATION
The LSM tree at `lsm.py:314` calls `self._wal.truncate()` with no sequence argument, indicating it uses a different WAL interface than the standalone `write-ahead-log/wal.py` module
- Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

### lsm-wal-truncate-no-error-recovery [IN] OBSERVATION
If any `open()` or `close()` call within `WAL.truncate()` fails, the exception propagates uncaught and can leave `self._fd` holding a closed handle, breaking subsequent `append()` calls
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-truncate.md

### lsm-wal-truncate-wb-then-ab [IN] OBSERVATION
`WAL.truncate()` uses a two-open sequence — open in `"wb"` mode to truncate the file, then reopen in `"ab"` mode for appending — because `"wb"` positions the cursor at offset 0 which is unsafe for a WAL
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-truncate.md

### lsm-wal-truncated-after-sstable-write [IN] OBSERVATION
The WAL is truncated only after a successful SSTable write; if a crash occurs between SSTable.write and WAL.truncate, replay harmlessly re-inserts into the memtable (idempotent because it's a dict).
- Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

### lww-auto-increments-timestamp [IN] OBSERVATION
Sequential `LWWRegister.set()` calls without explicit timestamps produce monotonically increasing timestamps (auto-incremented), so ordering is preserved even without caller-supplied clocks.
- Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

### lww-register-tiebreaks-on-replica-id [IN] OBSERVATION
When two `LWWRegister` writes have identical timestamps, the one with the lexicographically higher `replica_id` wins, making conflict resolution deterministic but arbitrary
- Source: entries/2026/05/29/topic-delta-state-crdts.md

### lww-tiebreak-is-deterministic [IN] OBSERVATION
`LWWRegister.merge` uses `(timestamp, writer_id)` tuple comparison, making conflict resolution a total order with no ambiguity since replica IDs are unique
- Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

### macos-fsync-incomplete-without-fullfsync [IN] OBSERVATION
On macOS/Darwin, `os.fsync()` does not guarantee flushing the disk write cache to stable storage; only `fcntl(fd, F_FULLFSYNC)` provides that guarantee, meaning all 13 fsync sites in the codebase may provide no power-loss durability on the development platform
- Source: entries/2026/05/29/topic-sqlite-durability-model.md

### macos-fsync-not-durable [IN] OBSERVATION
On the development platform (Darwin/APFS), `os.fsync()` may not flush the disk write cache; true durability requires `fcntl(fd, F_FULLFSYNC)`, which is never used at any of the 13 fsync call sites in the codebase.
- Source: entries/2026/05/29/topic-posix-durability-guarantees.md

### macos-fsync-not-durable-without-fullfsync [IN] OBSERVATION
On macOS (the development platform), `os.fsync()` may not flush the disk's hardware write cache; true power-loss durability requires `fcntl(fd, F_FULLFSYNC)` which is never used in the codebase.
- Source: entries/2026/05/29/topic-fsync-ordering-guarantees.md

### macos-fsync-weaker-than-implied [IN] OBSERVATION
The codebase uses `os.fsync()` exclusively, which on macOS/APFS does not flush the disk write cache; true durability requires `fcntl(fd, F_FULLFSYNC)` which is never used anywhere
- Source: entries/2026/05/29/topic-shadow-paging-or-steal-no-force.md

### map-side-join-cartesian-on-duplicates [IN] OBSERVATION
When multiple records share the same join key value, all three strategies produce the cartesian product of the matching left and right groups
- Source: entries/2026/05/29/map-side-join-map_side_joins.md

### map-side-join-conflict-prefix [IN] OBSERVATION
When left and right datasets share a non-key field name, `_merge_records` disambiguates with `left_` and `right_` prefixes; this convention applies in all join types including `None`-fill paths for left joins
- Source: entries/2026/05/29/map-side-join-map_side_joins.md

### map-side-join-missing-key-skip [IN] OBSERVATION
Records missing the join key are silently dropped and counted in `stats["skipped_records"]` rather than raising exceptions, across all three join strategies
- Source: entries/2026/05/29/map-side-join-map_side_joins.md

### map-side-join-no-real-parallelism [IN] OBSERVATION
Mapper parallelism is simulated via round-robin chunking and `_mapper_id` tagging on output records; no threads or processes are used
- Source: entries/2026/05/29/map-side-join-map_side_joins.md

### map-side-join-three-strategies [IN] OBSERVATION
The module implements exactly three join strategies (broadcast hash, partitioned hash, sort-merge) that all produce identical inner-join results when verified by `compare_join_strategies`
- Source: entries/2026/05/29/map-side-join-map_side_joins.md

### map-side-joins-track-mapper-id-on-output [IN] OBSERVATION
All three map-side join strategies attach a `_mapper_id` field to each output record, enabling verification that partitions/mappers operated independently
- Source: entries/2026/05/29/topic-ddia-ch10-reduce-side-joins.md

### mapreduce-combiner-transparent [IN] OBSERVATION
Applying a combiner never changes final results; it only reduces intermediate record volume — the combiner must be semantically invisible
- Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

### mapreduce-materializes-at-phase-boundaries [IN] OBSERVATION
MapReduce in `mapreduce.py` writes intermediate data to JSON partition files between map and reduce phases, creating a full materialization barrier that prevents streaming from mapper to reducer
- Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

### mapreduce-results-deterministic [IN] OBSERVATION
`MapReduceJob.run()` produces identical output regardless of `num_mappers`/`num_reducers` configuration for the same input data — the shuffle/partition step is deterministic
- Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

### mapreduce-run-accepts-filepath [IN] OBSERVATION
`MapReduceJob.run()` accepts either a list of `(key, value)` tuples or a file path string as input, with the file path handled transparently
- Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

### mapreduce-stats-tracks-records [IN] OBSERVATION
`job.stats` tracks `map_input_records`, `map_output_records`, `reduce_output_records`, worker counts, and elapsed time after each run
- Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

### mapreduce-strict-mode-default [IN] OBSERVATION
`MapReduceJob` defaults to strict mode where mapper/reducer exceptions propagate to the caller; `fault_tolerant=True` must be explicitly set to enable silent error skipping
- Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

### materialization-barriers-span-storage-and-processing [IN] OBSERVATION
Unbounded in-memory materialization is a cross-cutting concern from storage through batch processing: storage operations (LSM range scans, compaction, SSTable index rebuilds) materialize entire datasets, and pipeline stages (Count) silently accumulate all input before producing output, meaning the materialization problem exists at every layer of the data processing stack.

### max-keys-prevents-overflow [IN] OBSERVATION
The `max_keys_per_page` parameter bounds entries per node so that serialized output fits within `page_size`, with splits triggered before the limit is exceeded
- Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

### maybe-rotate-requires-lock [IN] OBSERVATION
`_maybe_rotate` must be called under `self._lock` but does not acquire or assert the lock itself; all three call sites (`append`, `append_batch`, `checkpoint`) satisfy this obligation.
- Source: entries/2026/05/29/write-ahead-log-wal-_maybe_rotate.md

### membership-correctness-masks-hint-data-loss [IN] OBSERVATION
The gossip membership lifecycle correctly detects and removes crashed hint-holding nodes, but this membership-layer correctness masks data-layer failure: hints stored on the crashed node are permanently lost with no recovery mechanism or error indication, and the cluster reports the failure as handled because the membership event was processed successfully.

### membership-detection-unreliable-in-accuracy-and-propagation [IN] OBSERVATION
The membership subsystem is unreliable at two independent levels that interact destructively: failure detection is both the single correctness bottleneck AND permanently miscalibrated (no arrival-rate history for adaptive thresholds), while membership and data convergence operate at fundamentally different rates (O(log N) membership gossip vs O(N) data propagation in ring topology), meaning incorrect liveness decisions propagate through membership faster than data can adjust to them.

### memtable-threshold-bounds-recovery [IN] OBSERVATION
The `memtable_threshold` parameter (`lsm.py:202`) directly caps the maximum number of WAL entries that must be replayed on crash recovery, because the WAL is truncated on every flush.
- Source: entries/2026/05/29/topic-wal-checkpoint-protocol.md

### merge-does-not-delete-inputs [IN] OBSERVATION
`merge_sstables` never deletes or modifies input SSTable files; cleanup is the caller's responsibility, and the current `CompactionManager` callers also do not delete old files from disk.
- Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

### merge-mutates-self-and-returns-self [IN] OBSERVATION
Every CRDT `merge()` method mutates the receiver in place and returns `self`, enabling chaining but meaning callers must `deepcopy()` before merging if the original state must be preserved
- Source: entries/2026/05/29/topic-delta-state-crdts.md

### merge-no-atomic-write [IN] OBSERVATION
`merge_sstables` writes directly to `output_path` with no temp-file-and-rename pattern, so a crash mid-merge leaves a corrupt partial file at the target path.
- Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

### merge-preserves-tombstones-by-default [IN] OBSERVATION
`merge_sstables` writes tombstones (`value=None`) to the output unless `remove_tombstones=True` is explicitly passed, and no current call site in `CompactionManager` enables removal.
- Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

### merge-sstables-tombstone-flag-is-caller-controlled [IN] OBSERVATION
`merge_sstables()` accepts a `remove_tombstones` boolean from the caller rather than computing safety from level metadata, so correctness depends entirely on the caller passing the right value
- Source: entries/2026/05/29/topic-leveled-compaction-tombstone-policy.md

### merkle-diff-asymmetric-leaf-counts [IN] OBSERVATION
Two trees with different `_leaf_count` but the same `_padded_size` can be diffed; positions where one tree has data and the other has `EMPTY_HASH` padding are reported as diffs
- Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

### merkle-diff-enables-efficient-sync [IN] OBSERVATION
`MerkleTree.diff` compares two trees in O(log n) by short-circuiting at matching subtree hashes, providing the divergence-detection layer that would feed `_receive_replica` with only changed keys rather than a full key scan
- Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

### merkle-diff-enables-targeted-key-range-repair [IN] OBSERVATION
Merkle tree diffs produce sorted leaf indices that map directly to key ranges when the tree is built with key-sorted input, enabling anti-entropy repair that transfers only divergent key ranges rather than full dataset comparisons.

### merkle-diff-filters-padding [IN] OBSERVATION
`diff()` excludes padding indices beyond `max(self._leaf_count, other._leaf_count)`, using `max` (not `min`) so positions where one tree has data and the other has `EMPTY_HASH` padding are still reported as diffs
- Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

### merkle-diff-is-pure [IN] OBSERVATION
`diff()` and `_diff_recursive` are read-only — they mutate neither tree, allocate only a local result list, and produce no side effects
- Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

### merkle-diff-order-is-ascending [IN] OBSERVATION
Differing leaf indices returned by `diff()` are in ascending order as a consequence of left-to-right DFS traversal in `_diff_recursive`
- Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

### merkle-diff-prunes-matching-subtrees [IN] OBSERVATION
`MerkleTree._diff_recursive()` returns immediately when subtree hashes match, making the diff cost proportional to the number of divergent keys rather than total tree size — the logarithmic efficiency that makes anti-entropy practical.
- Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

### merkle-diff-returns-leaf-indices [IN] OBSERVATION
`MerkleTree.diff()` returns a list of integer leaf indices where the two trees diverge, not subtree nodes or hash pairs
- Source: entries/2026/05/29/merkle-tree-test_merkle.md

### merkle-get-proof-rejects-padding-indices [IN] OBSERVATION
`get_proof` bounds-checks against `_leaf_count` (real leaves), not `_padded_size`, so proofs cannot be generated for padding slots even though they have valid hashes in the backing array
- Source: entries/2026/05/29/merkle-tree-merkle_tree-get_proof.md

### merkle-hashes-over-hex-not-bytes [IN] OBSERVATION
Internal node hashes are computed over concatenated 64-char hex digest strings (128 ASCII bytes) rather than raw 32-byte SHA-256 digests, resulting in 4x more data per hash input.
- Source: entries/2026/05/29/merkle-tree-merkle_tree-verify_proof.md

### merkle-proof-direction-is-sibling-position [IN] OBSERVATION
The `direction` field in each proof sibling tuple indicates where the **sibling** sits (left or right), not the current node — this controls hash concatenation order during verification
- Source: entries/2026/05/29/merkle-tree-merkle_tree-get_proof.md

### merkle-proof-is-snapshot [IN] OBSERVATION
Merkle proofs are point-in-time snapshots; a proof generated before `update_leaf` will fail verification against the new root hash, and nothing invalidates the stale proof in-place
- Source: entries/2026/05/29/merkle-tree-merkle_tree-get_proof.md

### merkle-proof-length-equals-height [IN] OBSERVATION
A proof for a tree with `_padded_size = 2^h` contains exactly `h` sibling entries, one per tree level excluding the root
- Source: entries/2026/05/29/merkle-tree-merkle_tree-get_proof.md

### merkle-proof-sibling-order-bottom-up [IN] OBSERVATION
`get_proof` returns siblings ordered from leaf level to root level (bottom-up), and `verify_proof` consumes them in that same order
- Source: entries/2026/05/29/merkle-tree-merkle_tree-get_proof.md

### merkle-proof-size-is-log-n [IN] OBSERVATION
A Merkle proof contains exactly `height` sibling hashes where height = log2(next_power_of_2(leaf_count)), making both proof size and verification cost O(log N).
- Source: entries/2026/05/29/topic-merkle-proof-security-model.md

### merkle-tree-array-indexing [IN] OBSERVATION
The tree uses 0-indexed implicit array layout where children of node `i` are at `2i+1` and `2i+2`, and leaves start at index `padded_size - 1`
- Source: entries/2026/05/29/merkle-tree-merkle_tree.md

### merkle-tree-hashes-raw-bytes [IN] OBSERVATION
The Merkle tree implementation avoids serialization canonicalization entirely by accepting caller-provided `bytes` and hashing them directly via `hashlib.sha256(data)`, making it immune to the JSON encoding ambiguity that affects the PBFT digest.
- Source: entries/2026/05/29/topic-json-canonicalization-risks.md

### merkle-tree-height-formula [IN] OBSERVATION
A `MerkleTree` with 4 leaves has `height == 2`, indicating height counts internal levels (log₂ of padded leaf count)
- Source: entries/2026/05/29/merkle-tree-test_merkle.md

### merkle-tree-supports-non-power-of-two [IN] OBSERVATION
The implementation handles non-power-of-2 leaf counts (tested with 1 and 3 leaves) via internal padding to the next power of 2 using `EMPTY_HASH`
- Source: entries/2026/05/29/merkle-tree-test_merkle.md

### merkle-tree-uses-sha256-for-cross-replica-verification [IN] OBSERVATION
The Merkle tree uses SHA-256 (`merkle_tree.py:11`) for tamper-evident cross-replica verification via `verify_proof`, while all storage engines use CRC32 — reflecting the split between accidental-corruption detection on trusted local disk and adversarial-integrity across trust boundaries
- Source: entries/2026/05/29/topic-crc32-vs-cryptographic-hash.md

### meta-page-fixed-20-byte-payload [IN] OBSERVATION
The metadata payload is exactly 20 bytes (5 big-endian uint32s packed as `'>5I'`), zero-padded to `page_size`; the format requires `page_size >= 20` but this is not enforced.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_write_meta.md

### meta-page-is-page-zero [IN] OBSERVATION
The B-tree metadata (root, height, total_keys, next_free, free_head) is always stored at file offset 0 in page 0, which is reserved as the superblock and occupies exactly `page_size` bytes.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_write_meta.md

### more-vnodes-monotonically-improves-balance [IN] OBSERVATION
The test `test_vnode_count_affects_balance` asserts that imbalance at 500 vnodes is strictly less than at 1 vnode, and the statistical model (variance ∝ 1/V) guarantees monotonic improvement.
- Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

### mr-combiner-is-map-side-only [IN] OBSERVATION
The combiner runs inside `_run_mapper` on each mapper's output for a single partition independently; it never sees data from other mappers or other partitions.
- Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

### mr-fault-tolerant-silently-drops [IN] OBSERVATION
When `fault_tolerant=True`, mapper and reducer exceptions are caught and silently skipped with no logging or error reporting, producing partial results without any indication of data loss.
- Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

### mr-groupby-requires-presort [IN] OBSERVATION
`_run_reducer` sorts all intermediate pairs by key before calling `itertools.groupby`, which is required because `groupby` only groups consecutive elements with the same key.
- Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

### mr-hash-partitions-keys [IN] OBSERVATION
MapReduceJob assigns intermediate keys to reducer partitions using `hash(key) % num_reducers`, ensuring all values for a given key reach the same reducer.
- Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

### mr-intermediate-data-uses-json-files [IN] OBSERVATION
Intermediate shuffle data is serialized as JSON files in a temp directory with naming convention `map-{mapper_id}-part-{partition}.json`.
- Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

### multi-leader-conflict-log-records-strategy [IN] OBSERVATION
Conflicts are recorded in `conflict_log` with metadata including the key and the `ConflictStrategy` enum value that resolved them.
- Source: entries/2026/05/29/multi-leader-replication-test_multi_leader.md

### multi-leader-conflict-record-captures-both-values [IN] OBSERVATION
`ConflictRecord` stores both `local_value` and `remote_value` along with the key and resolution strategy, providing a complete audit trail for every conflict resolution.
- Source: entries/2026/05/29/multi-leader-replication-tester_test_multi_leader.md

### multi-leader-convergence-skips-origin [IN] OBSERVATION
`all_converged()` compares `(value, timestamp, is_tombstone)` but intentionally ignores `origin_node_id` (tuple index 2), since different arrival orders can record different origins for the same resolved state
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

### multi-leader-custom-merge-idempotent-across-syncs [IN] OBSERVATION
Custom merge functions are applied once per conflict; repeated sync rounds on already-converged values must not re-trigger the merge, preventing bugs like counters doubling on every sync cycle.
- Source: entries/2026/05/29/multi-leader-replication-tester_test_multi_leader.md

### multi-leader-custom-merge-new-timestamp [IN] OBSERVATION
Custom merge resolution creates a new timestamp (`max(local_ts, remote_ts) + 1`) and canonical origin so the merged result supersedes both conflicting inputs in any future LWW comparison
- Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

### multi-leader-custom-merge-requires-merge-fn [IN] OBSERVATION
Constructing a cluster with `CUSTOM_MERGE` strategy and `merge_fn=None` is accepted silently, but raises `TypeError` at the first actual conflict during `sync()`
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

### multi-leader-idempotent-apply [IN] OBSERVATION
Each `(key, timestamp, origin_node)` triple is applied at most once per node, tracked by the `_seen` set, preventing duplicate application in ring topologies where changes propagate through multiple hops
- Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

### multi-leader-lamport-monotonic [IN] OBSERVATION
Successive `put()` calls on the same `ReplicaNode` yield strictly increasing Lamport timestamps.
- Source: entries/2026/05/29/multi-leader-replication-test_multi_leader.md

### multi-leader-lww-deterministic-tiebreak [IN] OBSERVATION
LWW conflict resolution compares `(timestamp, node_id)` tuples using Python's lexicographic tuple comparison, guaranteeing all nodes independently reach the same winner without coordination
- Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

### multi-leader-no-tombstone-gc [IN] OBSERVATION
Multi-leader replication stores tombstones indefinitely with no compaction or garbage collection method, which is safe but accumulates dead entries without bound
- Source: entries/2026/05/29/topic-sstable-tombstone-gc-safety.md

### multi-leader-ring-convergence-rounds [IN] OBSERVATION
RING topology requires at least N−1 `sync()` rounds to fully propagate a single-source change across N nodes, because each round advances the change by exactly one hop
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

### multi-leader-ring-needs-multiple-rounds [IN] OBSERVATION
Ring topology requires at least 2 sync rounds to propagate a write across 3+ nodes, unlike all-to-all topology which converges in a single round; `sync_until_converged()` loops to handle this.
- Source: entries/2026/05/29/multi-leader-replication-tester_test_multi_leader.md

### multi-leader-sync-collects-then-distributes [IN] OBSERVATION
`sync()` drains all pending queues from every node before distributing any changes, preventing intra-round cascading where one node's change would trigger further propagation within the same sync round
- Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

### multi-leader-sync-count-includes-noops [IN] OBSERVATION
The count returned by `sync()` includes idempotent no-op deliveries (changes already in the target's `_seen` set), so it reflects replication attempts, not accepted changes
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

### multi-leader-sync-designed-for-safe-convergence [IN] OBSERVATION
Multi-leader sync is designed for safe convergence: the two-phase collect-then-distribute pattern prevents intra-round cascading, custom merge functions are required to be idempotent across repeated syncs, and merged results receive fresh timestamps ensuring they supersede both inputs.

### multi-leader-sync-is-discrete [IN] OBSERVATION
Replication occurs only when `sync()` is explicitly called; there is no background replication thread or event-driven propagation, giving tests full control over replication timing.
- Source: entries/2026/05/29/multi-leader-replication-tester_test_multi_leader.md

### multi-leader-tombstone-delete [IN] OBSERVATION
Deletes are implemented as tombstone writes (`_TOMBSTONE` sentinel with `is_tombstone=True`) that replicate like normal mutations; `get()` returns `None` for tombstoned keys
- Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

### multiple-core-components-assume-single-thread-silently [IN] OBSERVATION
The B-tree, garbage collector, and consistent hash ring all assume single-threaded access without locks, assertions, or documentation — a systematic pattern where concurrency unsafety is implicit rather than enforced.

### mvcc-active-set-frozen-at-begin [IN] OBSERVATION
`MVCCDatabase.begin_transaction` captures `active_at_start` as a frozen set of all currently active transaction IDs at that moment; it is never modified after creation and serves as the transaction's immutable snapshot boundary
- Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

### mvcc-active-tx-not-recoverable [IN] OBSERVATION
Active (uncommitted) transactions cannot survive a process restart; the safe recovery strategy is to treat them as aborted, which is consistent with `_is_visible`'s existing handling of aborted transaction versions as invisible
- Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

### mvcc-append-only-versions [IN] OBSERVATION
Writers never modify existing `Version` objects (except self-overwrites); new values are appended to per-key version lists, preserving the immutable snapshot for concurrent readers
- Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

### mvcc-counters-must-be-monotonic [IN] OBSERVATION
`_next_tx_id` and `_next_timestamp` must never reissue a previously used value; if they reset to 1 on restart, visibility comparisons like `created_by < tx.tx_id` silently produce wrong results by conflating old and new transaction IDs
- Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

### mvcc-no-persistence [IN] OBSERVATION
`MVCCDatabase` has zero serialization or persistence support; all state (`_versions`, `_transactions`, `_committed`, `_aborted`, counters) lives only in memory with no `to_dict`, `from_dict`, or file I/O of any kind
- Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

### mvcc-pattern-exists-at-row-level [IN] OBSERVATION
The codebase implements MVCC visibility (`ssi_database.py:_visible_value`, `mvcc_database.py:Version`) at the row/key level but not at the SSTable-file level where LevelDB uses Version reference counting for concurrent reader isolation
- Source: entries/2026/05/29/topic-mvcc-version-snapshots.md

### mvcc-three-layer-visibility-model [IN] OBSERVATION
MVCC visibility is enforced through three complementary invariants: append-only version storage prevents lost updates, own-writes are always visible regardless of commit state, and deletions follow the same visibility rules as writes — collectively ensuring snapshot consistency.

### mvcc-txid-monotonicity-assumed [IN] OBSERVATION
The visibility check `created_by < tx.tx_id` assumes transaction IDs are monotonically increasing to mean "started before us"; non-monotonic or reused IDs would break snapshot correctness.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-_is_visible.md

### mvcc-uncommitted-data-memory-only [IN] OBSERVATION
In MVCCDatabase, uncommitted transaction writes exist only as in-memory Version objects in `_versions` and never reach a persistent data store; abort works by adding the tx_id to `_aborted` so visibility checks filter it out
- Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

### mvcc-visibility-depends-on-tx-metadata [IN] OBSERVATION
`_is_visible` requires `_committed`, `_aborted`, and each transaction's `active_at_start` set; persisting version chains without this metadata makes them unreadable after restore since visibility cannot be determined
- Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

### naming-convention-not-uniform [IN] OBSERVATION
The tester/test split uses at least three naming patterns across directories: `tester_test_*.py`, `test_*_tester.py`, and `test_tester_*.py`
- Source: entries/2026/05/29/topic-tester-test-pattern.md

### neither-bitcask-supports-expiry [IN] OBSERVATION
Neither Bitcask implementation supports per-key TTL or expiry; `hash-index-storage` stores a timestamp in the header but never checks it for expiration, and `log-structured-hash-table` omits timestamps entirely.
- Source: entries/2026/05/29/topic-bitcask-paper-vs-implementation.md

### neither-partitioning-strategy-fully-reliable [IN] OBSERVATION
Neither partitioning strategy offers both reliable routing and ordered access: range partitioning preserves key ordering but depends on maintaining parallel arrays in lockstep with invariants that are never verified (boundaries/partitions correspondence, adjacency assumptions in merge), while hash partitioning provides deterministic single-lookup routing but permanently destroys lexicographic key order, making range queries impossible.

### network-partitions-amplify-write-correctness-gaps [IN] OBSERVATION
Network partitions create a force multiplier on distributed write correctness gaps: partitions disrupt gossip-based failure detection (the single correctness gate for replication, read repair, and hinted handoff), while writes that do proceed operate under already-weakened quorum semantics (sloppy quorums count hints as successes, sub-quorum configurations accepted with only a warning).

### next-free-page-is-monotonic [IN] OBSERVATION
The `next_free_page` metadata field only ever increases — it is the high-water mark of file extension, never decremented even when pages are freed; file size never shrinks, only interior pages are recycled via the free list.
- Source: entries/2026/05/29/topic-free-list-reuse-under-churn.md

### no-atomic-file-creation [IN] OBSERVATION
No SSTable writer or compaction routine in any implementation uses the write-temp/fsync/rename pattern; all write directly to the final file path
- Source: entries/2026/05/29/topic-write-to-temp-then-rename-pattern.md

### no-ci-or-runner-for-tester-tests [IN] OBSERVATION
No CI pipeline (no `.github/` directory), Makefile, or orchestration script exists to invoke `tester_test_*.py` files; they are run manually via `python tester_test_*.py`
- Source: entries/2026/05/29/topic-tester-runner-system.md

### no-compaction-manifest [IN] OBSERVATION
None of the three storage engines (hash-index, log-structured-hash-table, LSM) use a manifest or compaction log to make segment replacement atomic; segment discovery is purely filesystem-based via directory listing.
- Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

### no-compaction-rate-limiting [IN] OBSERVATION
Neither LSM implementation throttles compaction I/O throughput; a large merge will saturate disk bandwidth without regard to concurrent foreground reads or writes.
- Source: entries/2026/05/29/topic-bitcask-merge-window-scheduling.md

### no-component-resyncs-after-corruption [IN] OBSERVATION
Every binary record reader in this codebase (WAL, both Bitcasks, LSM WAL, SSTable, B-tree WAL) stops at the first CRC failure, short read, or parse error; none attempt to scan forward for the next valid record boundary
- Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

### no-concurrency-primitives [IN] OBSERVATION
The entire B-tree implementation contains zero locking, latching, mutex, or thread-safety constructs; it is strictly single-threaded with no concurrent-access support
- Source: entries/2026/05/28/topic-concurrent-btree-deletion.md

### no-concurrency-protection-on-base-offsets [IN] OBSERVATION
Neither `truncate` nor `compact_partition` holds a lock; concurrent calls on the same partition would race on both `_partitions` and `_base_offsets`, assuming single-threaded execution.
- Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

### no-conftest-anywhere [IN] OBSERVATION
The ddia-implementations repo contains zero `conftest.py` files at any directory level — confirmed by exhaustive grep across all 37 modules, not just root
- Source: entries/2026/05/29/topic-conftest-hierarchy.md

### no-cross-file-fsync-ordering-protocol [IN] OBSERVATION
The standalone WAL module provides no mechanism or documentation for enforcing the required WAL-before-data fsync ordering; callers must implement the `fsync(wal)` → `write(data)` → `fsync(data)` protocol themselves.
- Source: entries/2026/05/29/topic-fsync-ordering-guarantees.md

### no-cuckoo-filter-in-codebase [IN] OBSERVATION
The repository contains no cuckoo filter implementation; the counting Bloom filter is the only deletion-supporting probabilistic filter, leaving the saturation problem unsolved
- Source: entries/2026/05/29/topic-counter-overflow-and-cuckoo-filters.md

### no-cuckoo-or-xor-filter-in-repo [IN] OBSERVATION
The repository has no cuckoo filter, quotient filter, or xor filter implementation; CountingBloomFilter is the only probabilistic membership structure supporting deletion.
- Source: entries/2026/05/29/topic-cuckoo-filters.md

### no-data-path-is-trustworthy [IN] OBSERVATION
No data path through the system is trustworthy: the primary storage path through compaction is unsalvageable (crash-unsafe within a broken durability pipeline, no concurrent access protection) while the secondary derived-data path through event infrastructure is unreliable in both content semantics and ordering guarantees, leaving no channel through which data can be read or propagated with confidence.

### no-defense-in-depth-against-corruption [IN] OBSERVATION
The system has no defense-in-depth against data corruption: input validation is systematically absent at API boundaries (malformed data enters freely), and corruption is terminal across all readers with no resync capability (corrupted data can never be recovered) — the system is maximally permeable to corruption entry and maximally fragile to corruption presence.

### no-direct-page-writes [IN] OBSERVATION
The BTree class never calls `pm.write_page` directly for data pages; all data page writes are routed through `_wal_write_page` (or `_wal_write_meta` for page 0) to maintain the write-ahead invariant.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_wal_write_page.md

### no-directory-fsync-anywhere [IN] OBSERVATION
No implementation in the codebase calls `os.fsync()` on a directory file descriptor; all fsync calls target data file descriptors only
- Source: entries/2026/05/28/topic-directory-fsync-semantics.md

### no-disk-artifact-patterns [IN] OBSERVATION
On-disk files created by storage-engine modules (B-tree pages, WAL segments, SSTable files) are not covered by `.gitignore`, relying on test teardown or manual cleanup
- Source: entries/2026/05/29/unknown.md

### no-double-free-detection [IN] OBSERVATION
Calling `free_page` twice on the same page creates a cycle in the free list, which `allocate_page` cannot detect, leading to silent data corruption through repeated page reuse.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-free_page.md

### no-fixed-block-boundaries-in-sstable [IN] OBSERVATION
Both SSTable implementations write variable-length records contiguously with no fixed-size block structure, making mid-file resynchronization after corruption impossible — a corrupted length field causes the reader to consume all subsequent valid records as garbage.
- Source: entries/2026/05/29/topic-block-based-wal-format.md

### no-high-key-fence [IN] OBSERVATION
No node stores a high-key (upper-bound fence key); searchers cannot detect mid-split state, which would be required for Lehman & Yao's concurrent search protocol
- Source: entries/2026/05/29/topic-lehman-yao-1981.md

### no-high-key-per-node [IN] OBSERVATION
Nodes store no upper-bound "high key," so a reader cannot detect that a concurrent split moved keys to a right sibling; this is the second missing prerequisite for Lehman-Yao's single-latch protocol
- Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

### no-leader-lease-in-codebase [IN] OBSERVATION
Neither the Raft nor TOB implementations include leader lease or ReadIndex optimizations; all read and write operations pay full consensus cost.
- Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

### no-leak-detection-mechanism [IN] OBSERVATION
The B-tree has no allocation bitmap, page-reachability check, or post-recovery consistency scan to detect leaked or orphaned pages; once a page is lost between the tree and the free list, it is silently unrecoverable.
- Source: entries/2026/05/29/topic-free-list-corruption-risks.md

### no-o-direct-usage [IN] OBSERVATION
No file in the ddia-implementations codebase uses `O_DIRECT`, `os.open()`, or any `os.O_` flags; all I/O goes through Python's buffered `open()` and the kernel page cache
- Source: entries/2026/05/29/topic-o-direct-and-dio.md

### no-operational-path-to-correctness [IN] OBSERVATION
There is no operational path to system correctness: the system has no stable regime at any scale (storage degrades monotonically, failures widen the correctness gap, no paradigm is both scalable and self-healing), AND even if a stable state existed, correctness could not be verified (the gap between specification and implementation widens under every failure mode and testing validates the wrong model).

### no-page-size-enforcement-in-serialize [IN] OBSERVATION
Neither _serialize_internal nor _serialize_leaf checks whether the serialized output fits within page_size; overflow prevention is the caller's responsibility via the max_keys limit in BTree._insert.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_serialize_internal.md

### no-pytest-config-exists [IN] OBSERVATION
The repository contains no `pyproject.toml`, `pytest.ini`, `setup.cfg`, or root `conftest.py`; pytest runs with pure default settings
- Source: entries/2026/05/29/topic-pytest-configuration.md

### no-refcount-in-codebase [IN] OBSERVATION
No file in the repository implements reference counting on SSTable files, Version objects, or any storage-layer snapshot structure
- Source: entries/2026/05/29/topic-mvcc-version-lifetime.md

### no-safe-operating-mode-exists [IN] OBSERVATION
The system has no safe operating mode in any scenario: it is unsafe under concurrent access (both read and write paths lack synchronization across core components) AND unsafe during crash recovery (no storage engine has a fully safe recovery path), meaning correctness is unachievable whether the system is running normally under load or recovering from failure.

### no-schema-validation [IN] OBSERVATION
The batch pipeline performs no validation that adjacent stages produce/consume compatible record formats; mismatched tuple shapes fail at runtime with indexing errors
- Source: entries/2026/05/29/batch-word-count-pipeline.md

### no-shadow-paging-implementation-exists [IN] OBSERVATION
The codebase contains no copy-on-write or shadow paging implementation; all three storage engines use write-ahead logging exclusively for crash recovery
- Source: entries/2026/05/29/topic-wal-vs-shadow-paging.md

### no-sibling-sentinel-is-max-uint32 [IN] OBSERVATION
`NO_SIBLING = 0xFFFFFFFF` marks the end of the leaf sibling chain; a leaf with this value is the rightmost at its level
- Source: entries/2026/05/29/topic-blink-tree-paper.md

### no-storage-paradigm-is-both-scalable-and-self-healing [IN] OBSERVATION
No storage paradigm offers both scalability and structural resilience: hash indexes are memory-bound, LSM miss-probes are linear due to missing Bloom integration, and all paradigms degrade monotonically after crashes with no safe recovery path (crash-unsafe compaction, incomplete CRC, replay without batch atomicity).

### no-structured-topology-support [IN] OBSERVATION
The codebase has no ring, star, or mesh topology modes for gossip; peer selection is always uniform random from all active nodes.
- Source: entries/2026/05/29/topic-topology-propagation-bounds.md

### no-subsystem-has-viable-recovery-strategy [IN] OBSERVATION
No subsystem at any architectural tier has a viable recovery strategy: storage-layer recovery is paradoxically over-engineered in infrastructure and under-implemented in usage (WAL builds complete but unused sequence/checkpoint machinery, with no safe crash recovery path), while application-layer event sourcing lacks durable persistence and uses conflated ID spaces — recovery fails both where infrastructure exists but is unused and where it was never built.

### no-transactional-offset-commit [IN] OBSERVATION
The partitioned log provides no mechanism to atomically commit consumer offsets together with processing side effects, ruling out exactly-once delivery without external idempotency
- Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

### no-vector-clocks-anywhere [IN] OBSERVATION
The entire repository uses scalar Lamport clocks for ordering; no vector clock or version vector implementation exists, so true concurrency detection (distinguishing causal order from concurrent writes) is absent
- Source: entries/2026/05/29/topic-conflict-resolution-strategies.md

### none-mode-never-syncs-on-append [IN] OBSERVATION
In `none` sync mode, `_do_sync()` is a complete no-op for individual appends; data reaches stable storage only via WAL rotation or explicit close
- Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

### not-lehman-yao-blink-tree [IN] OBSERVATION
The B-tree is a classical single-threaded B⁺-tree, not a Lehman-Yao B-link tree — internal nodes lack right-link pointers, there are no high keys, and no latch coupling; leaf sibling pointers exist solely for range scan traversal.
- Source: entries/2026/05/29/topic-internal-node-format.md

### offset-tracking-per-partition [IN] OBSERVATION
Consumer tracks offsets as a `dict[tuple[str, int], int]` mapping `(topic, partition)` to offset, so commit granularity is per-partition, not per-message
- Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

### only-new-primary-advances-view-in-vc [IN] OBSERVATION
Non-primary nodes do not update `current_view` in `_handle_view_change`; they wait for the `NEW_VIEW` message to adopt the new view, leaving them in a liminal state that rejects normal-phase messages for both old and new views.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_handle_view_change.md

### operation-result-tracks-partition-cost [IN] OBSERVATION
Every `DocumentPartitionedDB` and `TermPartitionedDB` operation returns an `OperationResult` with a `partitions_touched` field, making read/write amplification cost observable and directly comparable between strategies.
- Source: entries/2026/05/29/secondary-index-partitioning-test_secondary_index_partitioning.md

### ordering-infrastructure-broken-at-every-layer [IN] OBSERVATION
Ordering and position infrastructure is broken or volatile at every layer: WAL sequence numbers are diligently computed and stored but never consulted during recovery, event sourcing conflates stream-scoped and global position IDs creating addressing ambiguity, and CDC consumer positions exist only in memory and are lost on restart — no ordering mechanism in the codebase reliably serves its intended purpose.

### ordering-models-are-incompatible-across-modules [IN] OBSERVATION
The codebase uses three incompatible time/ordering models with no unifying framework: Lamport clocks provide total order via node-ID tiebreaking, vector clocks provide partial order with explicit concurrency detection, and wall-clock timestamps provide no causal guarantees and are not even monotonic.

### orset-add-wins-semantics [IN] OBSERVATION
Concurrent add and remove of the same element in `ORSet` resolves in favor of the add, because the add's unique tag was not in the remover's tombstone snapshot
- Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

### orset-no-causal-context [IN] OBSERVATION
The ORSet implementation uses no version vector, dot context, or causal summary to compress tombstone state; each removed tag is stored individually as a `(replica_id, seq)` tuple in the tombstone set
- Source: entries/2026/05/29/topic-or-set-tombstone-growth.md

### orset-remove-returns-false-for-absent [IN] OBSERVATION
`ORSet.remove()` returns `False` for a non-existent element rather than raising an exception, treating removal of absent elements as a no-op with a boolean status indicator.
- Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

### orset-tombstone-growth-unbounded-and-necessary [IN] OBSERVATION
ORSet tombstones create an inherent resource leak: they must be retained indefinitely for merge correctness (removing them causes deleted elements to reappear when merging with replicas that still hold the original tags), yet they grow monotonically with no compaction or garbage collection, meaning memory consumption increases without bound over the set's lifetime proportional to total remove operations.

### orset-tombstone-required-for-merge-correctness [IN] OBSERVATION
Dropping ORSet tombstones would cause removed elements to reappear when merging with a replica that still holds the element's active tags — the tombstone set in `merge()` at line 165 suppresses stale tags via set difference
- Source: entries/2026/05/29/topic-or-set-tombstone-growth.md

### orset-tombstone-set-monotonic [IN] OBSERVATION
`ORSet._tombstones` is append-only: tags are added during `remove()` and unioned during `merge()` but never deleted, so the tombstone set grows without bound over the lifetime of the replica
- Source: entries/2026/05/29/topic-or-set-tombstone-growth.md

### orset-tombstones-grow-monotonically [IN] OBSERVATION
`ORSet._tombstones` is only ever unioned during `merge()` and added to during `remove()`; there is no compaction or garbage collection, so the tombstone set grows without bound
- Source: entries/2026/05/29/topic-delta-state-crdts.md

### overlapping-determines-merge-set [IN] OBSERVATION
The merge set for leveled compaction is always exactly one source SSTable plus all SSTables in the target level whose key range intersects it, constructed via `_overlapping()` at `sstable.py:428-429`.
- Source: entries/2026/05/29/topic-leveled-compaction-write-amplification.md

### overlapping-is-pure-query [IN] OBSERVATION
`_overlapping` performs no I/O, mutation, or side effects — it operates entirely on in-memory metadata (`min_key`, `max_key`) cached during `SSTableReader` construction.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_overlapping.md

### overlapping-uses-interval-overlap-test [IN] OBSERVATION
`_overlapping` determines SSTable key-range overlap using the standard interval overlap test (`a.min_key <= b.max_key AND a.max_key >= b.min_key`) with lexicographic string comparison matching the SSTable sort order.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_overlapping.md

### own-writes-always-visible [IN] OBSERVATION
A transaction always sees versions it created itself (`created_by == tx.tx_id`), regardless of commit status, ensuring read-your-own-writes consistency within a transaction.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-_is_visible.md

### padding-leaves-use-empty-hash [IN] OBSERVATION
Non-power-of-2 leaf counts are padded with `EMPTY_HASH` (SHA-256 of empty bytes) to form a complete binary tree; these padding nodes appear in proof paths but do not represent real data.
- Source: entries/2026/05/29/topic-merkle-proof-security-model.md

### page-manager-4096-coincidence [IN] OBSERVATION
`PageManager` defaults to 4096-byte pages matching the OS page size and common disk sector size, but this is a coincidental match with `O_DIRECT` alignment requirements — the implementation uses buffered I/O through Python's `open()` and never enforces alignment
- Source: entries/2026/05/29/topic-o-direct-aligned-io.md

### page-manager-flush-not-durable [IN] OBSERVATION
PageManager calls `flush()` on every `write_page` and `_write_meta` call but only calls `os.fsync()` in `sync()` and `close()`, so individual page writes are not crash-durable without the WAL layer
- Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

### page-manager-meta-reads-untracked [IN] OBSERVATION
`read_meta()` does not increment `pages_read`, so metadata access is invisible to the I/O statistics reported by `BTree.stats`
- Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

### page-manager-metadata-durability-gap [IN] OBSERVATION
B-tree metadata has a durability gap at both lifecycle points: initial creation writes metadata directly without WAL protection, and all subsequent updates flush but never fsync, meaning metadata can be lost or inconsistent after any crash regardless of when it occurred.

### page-manager-no-page-size-on-disk [IN] OBSERVATION
The page size is not persisted in the data file; reopening with a different `page_size` value silently corrupts all page reads with no error or detection mechanism
- Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

### page-padding-to-page-size [IN] OBSERVATION
`_wal_write_page` pads data to exactly `page_size` with null bytes before passing it to both `wal.log_write` and `pm.write_page`, ensuring the WAL record and data file page are byte-identical.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-_wal_write_page.md

### page-size-default-is-4096 [IN] OBSERVATION
`PageManager` defaults to 4096-byte pages, matching typical filesystem block size; this sets the hard upper bound for serialized leaf and internal node content.
- Source: entries/2026/05/29/topic-page-overflow-and-split-mechanics.md

### page-zero-is-metadata [IN] OBSERVATION
Page 0 is always the metadata page and must never be written directly — only through `_write_meta`; user data pages start at page 1, with the initial empty leaf root always at page 1
- Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

### partial-hint-load-truncates-silently [IN] OBSERVATION
`_load_hint_file` in both Bitcask implementations treats a short read as end-of-file (breaks out of the read loop without error), silently dropping all keys that would have appeared after the truncation point
- Source: entries/2026/05/29/topic-bitcask-hint-durability.md

### participant-cannot-self-resolve [IN] OBSERVATION
`Participant.recover()` returns a list of in-doubt transaction IDs but cannot commit or abort them; only the coordinator holds the decision, so participants remain locked until `Coordinator.recover()` re-sends it.
- Source: entries/2026/05/29/topic-2pc-blocking-problem.md

### partition-failures-compound-across-membership-and-consensus [IN] OBSERVATION
Network partitions create cascading failures across the membership and consensus layers: gossip-based failure detection is the single correctness gate for replication, and Raft's partition hazard (stale leader silently accepting doomed writes, forced re-election on rejoin) depends on membership accuracy that gossip's timeout-based suspicion may misjudge.

### partitioned-hash-join-partitions-both-sides [IN] OBSERVATION
`PartitionedHashJoin.join()` calls `partition_dataset()` on both inputs before joining, meaning partitions are computed at join time rather than assumed from storage layout
- Source: entries/2026/05/29/topic-ddia-ch10-reduce-side-joins.md

### partitioned-log-compaction-preserves-keyless [IN] OBSERVATION
Log compaction retains only the last occurrence per key but always preserves messages with `key=None`.
- Source: entries/2026/05/29/partitioned-log-partitioned_log.md

### partitioned-log-key-routing-deterministic [IN] OBSERVATION
Messages with the same key always route to the same partition via `MD5(key) % num_partitions`, guaranteeing per-key ordering within a partition.
- Source: entries/2026/05/29/partitioned-log-partitioned_log.md

### partitioned-log-offset-monotonic [IN] OBSERVATION
Offsets within a partition are strictly increasing and never reused; `_base_offsets` only advances forward via `truncate` and `compact_partition`.
- Source: entries/2026/05/29/partitioned-log-partitioned_log.md

### partitioned-log-partition-count-limit [IN] OBSERVATION
`Topic.__init__` enforces partition count between 1 and 128, raising `ValueError` outside that range.
- Source: entries/2026/05/29/partitioned-log-partitioned_log.md

### partitioned-log-persistence-jsonl [IN] OBSERVATION
When `persist_dir` is set, messages are appended as JSONL files named `{topic}_{partition}.jsonl` and committed offsets are written to a single `offsets.json`.
- Source: entries/2026/05/29/partitioned-log-partitioned_log.md

### partitioned-log-rebalance-eager-round-robin [IN] OBSERVATION
`ConsumerGroup.rebalance()` is synchronous and triggered on every membership change, distributing partitions round-robin across sorted consumer IDs — no incremental or cooperative protocol.
- Source: entries/2026/05/29/partitioned-log-partitioned_log.md

### payload-only-crc-leaves-metadata-unprotected [IN] OBSERVATION
All three storage engine WAL implementations share a systematic integrity blind spot: CRC checksums cover only data payloads, leaving routing metadata (sequence numbers, page numbers) unprotected against silent corruption.

### pbft-cluster-validates-n-3f-plus-1 [IN] OBSERVATION
`PBFTCluster` raises `ValueError` if `n` is not exactly `3f+1` or if more than `f` byzantine nodes are specified, enforcing the minimum redundancy required for Byzantine fault tolerance.
- Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

### pbft-commit-self-logged-before-broadcast [IN] OBSERVATION
`_check_prepared` logs the node's own COMMIT into `message_log` and `phase_senders` before returning it for broadcast, so duplicate detection correctly rejects re-delivery of the node's own COMMIT.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_check_prepared.md

### pbft-default-str-fragile [IN] OBSERVATION
The `default=str` parameter in the PBFT digest function silently converts non-serializable objects to their `str()` representation, which is not deterministic across replicas for types like `datetime` or custom objects — a latent consensus-breaking bug if non-primitive request payloads are introduced.
- Source: entries/2026/05/29/topic-json-canonicalization-risks.md

### pbft-duplicate-messages-silently-dropped [IN] OBSERVATION
`PBFTNode.receive_message` returns an empty list for duplicate or invalid messages (out-of-range sender IDs, already-seen tuples) rather than raising exceptions, consistent with Byzantine protocol design where you can't trust the sender.
- Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

### pbft-executed-log-sequence-numbers-contiguous [IN] OBSERVATION
Honest nodes execute requests with contiguous sequence numbers starting from 1, enforced by the protocol's ordering guarantees and verified by test assertions on the `_executed_log`.
- Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

### pbft-prepared-at-most-once [IN] OBSERVATION
A `(view, seq)` pair can transition to "prepared" at most once per node, enforced by the `prepared_requests` set guard at the top of `_check_prepared`.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_check_prepared.md

### pbft-prepared-chains-to-committed [IN] OBSERVATION
`_check_prepared` always calls `_check_committed` after marking prepared, enabling immediate commit if enough COMMIT messages arrived out of order before the node reached the prepared state.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_check_prepared.md

### pbft-prepared-threshold-is-2f [IN] OBSERVATION
`_check_prepared` requires exactly 2f matching PREPAREs (not 2f+1) because the primary's PRE-PREPARE counts as implicit agreement toward the quorum of 2f+1.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_check_prepared.md

### pbft-primary-skips-prepare [IN] OBSERVATION
The primary sends PRE-PREPARE but not PREPARE; `_handle_prepare` rejects PREPAREs from the primary, and the primary's code path in `_handle_pre_prepare` skips sending a PREPARE — this asymmetry is why the prepared threshold is 2f not 2f+1.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_check_prepared.md

### pbft-sole-sort-keys-user [IN] OBSERVATION
`byzantine-fault-tolerance/pbft.py:46` is the only call site in the codebase that uses `json.dumps(sort_keys=True)` for hashing; all other `json.dumps` calls are for storage serialization where canonical form is irrelevant.
- Source: entries/2026/05/29/topic-json-canonicalization-risks.md

### pbft-submit-request-is-synchronous-simulation [IN] OBSERVATION
`PBFTCluster.submit_request` runs the full three-phase protocol to completion deterministically in a single call (no real networking or async), returning a boolean success indicator.
- Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

### pbft-view-change-preserves-prepared-requests [IN] OBSERVATION
Requests that reach the prepared state survive view changes: the new primary collects prepared-but-uncommitted requests from VIEW_CHANGE messages and re-proposes them with fresh PRE_PREPARE messages in the new view.
- Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

### pbft-view-change-preserves-safety-and-liveness [IN] OBSERVATION
PBFT view changes maintain both safety (requiring 2f+1 messages before acting) and liveness (carrying prepared-but-uncommitted requests into the new view), matching the theoretical protocol guarantees.

### pending-queue-drives-ring-propagation [IN] OBSERVATION
Every accepted remote change (including synthesized merge results) is appended to `_pending` for downstream propagation, enabling multi-hop replication in ring topologies where a single sync round cannot reach all nodes.
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-apply_remote_change.md

### pending-requeue-enables-multihop [IN] OBSERVATION
Every accepted remote change in `apply_remote_change` is appended to `_pending`, causing it to be forwarded to the next node in the replication topology during the next sync cycle
- Source: entries/2026/05/29/topic-ring-topology-propagation.md

### per-record-crc-cannot-detect-reordering [IN] OBSERVATION
Independent per-record CRCs in `wal.py`, `btree.py`, and `bitcask.py` validate each record in isolation, so record deletion, reordering, or splicing produces a file where every CRC passes but the log is semantically corrupt
- Source: entries/2026/05/29/topic-cumulative-checksums.md

### persist-after-memory [IN] OBSERVATION
The event is appended to `self._events` (in-memory) before `_persist_event` writes it to disk; a disk write failure leaves the in-memory store ahead of the on-disk log with no rollback.
- Source: entries/2026/05/29/event-sourcing-store-event_store-_persist_event.md

### persist-before-notify [IN] OBSERVATION
Disk persistence via `_persist_event` happens before subscriber notification: an event is written to the NDJSON file before any `LiveProjection` or other subscriber sees it.
- Source: entries/2026/05/29/event-sourcing-store-event_store-_persist_event.md

### persist-event-no-fsync [IN] OBSERVATION
`_persist_event` writes to the OS buffer via `file.write()` but never calls `fsync`; event data can be lost on crash between write and OS flush to disk.
- Source: entries/2026/05/29/event-sourcing-store-event_store-_persist_event.md

### persist-event-per-call-open [IN] OBSERVATION
The persist file is opened in append mode and closed on every single `_persist_event` call, not held open across the store's lifetime — safe but incurs per-event open/close overhead.
- Source: entries/2026/05/29/event-sourcing-store-event_store-_persist_event.md

### pipeline-lazy-pull-model [IN] OBSERVATION
Pipeline execution is pull-based via nested Python generators; records flow only when the terminal consumer iterates, and stages interleave execution without threads
- Source: entries/2026/05/29/batch-word-count-pipeline.md

### pipeline-no-inter-stage-schema-validation [IN] OBSERVATION
Pipeline stages pass untyped tuples with no validation that the output shape of one stage matches the input expectations of the next; a mismatched stage composition fails at runtime, not at construction time
- Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

### pipeline-records-are-tuples [IN] OBSERVATION
Pipeline stages produce and consume lists of `(key, value)` tuples, where key is a string and value depends on the stage (string for Tokenize, int for Count).
- Source: entries/2026/05/29/batch-word-count-test_pipeline.md

### pipeline-run-returns-list [IN] OBSERVATION
`Pipeline.run()` materializes all output records into a `list`, while `run_lazy()` returns an iterator that yields records on demand without full materialization.
- Source: entries/2026/05/29/batch-word-count-test_pipeline.md

### pipeline-stages-hide-materialization-barriers [IN] OBSERVATION
Pipeline stages have fundamentally different resource profiles hidden behind a uniform tuple interface: Count is an unbounded memory barrier that materializes all input before emitting any output, Sort bounds memory via external merge-sort disk spillover, yet both produce and consume identical tuple lists with no way for the pipeline to distinguish streaming stages from blocking ones.

### pncounter-composed-of-two-gcounters [IN] OBSERVATION
PNCounter state is structured as two GCounter sub-counters (`p` for increments, `n` for decrements) with `value() = p - n`, confirming the standard decomposition from CRDT literature.
- Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

### pncounter-composes-gcounters [IN] OBSERVATION
`PNCounter` delegates entirely to two `GCounter` instances (`p` for increments, `n` for decrements); it contains no independent merge or counting logic
- Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

### point-lookup-always-reaches-leaf [IN] OBSERVATION
Every `get` must descend to a leaf node regardless of tree height since internal nodes hold no values; cost is exactly *h* page reads for a tree of height *h*
- Source: entries/2026/05/29/topic-b-plus-tree-leaf-chains.md

### post-crash-compaction-produces-irrecoverable-corruption [IN] OBSERVATION
A crash during compaction produces irrecoverable data loss: no implementation uses write-temp/fsync/rename for atomicity (Bitcask deletes before renaming, LSM lacks atomic rename), and the resulting file corruption is terminal because every reader halts at the first CRC failure with no resync or skip capability.

### postgres-defers-btree-rebalancing-to-vacuum [IN] OBSERVATION
PostgreSQL's `nbtree` never merges or redistributes siblings on delete; it marks tuples dead and relies on VACUUM to reclaim space, trading immediate space efficiency for throughput under concurrency
- Source: entries/2026/05/28/topic-postgres-nbtree-lazy-deletion.md

### predicate-exceptions-silently-skipped [IN] OBSERVATION
If the predicate function throws on a `(key, value)` pair in `read_predicate`, that pair is silently excluded from results via bare `except Exception: pass` — no logging, no error propagation
- Source: entries/2026/05/29/write-skew-detection-ssi_database-read_predicate.md

### primary-key-reads-unaffected-by-async [IN] OBSERVATION
`get(pk)` reads directly from `partitions[pid].documents` and is never affected by `async_index` mode — staleness only applies to `query_by_field` which reads from `global_index`
- Source: entries/2026/05/29/topic-async-index-consistency.md

### producer-keyed-hash-unkeyed-roundrobin [IN] OBSERVATION
Partitioned log `Producer` hashes keyed messages to a fixed partition for ordering guarantees, and round-robins unkeyed messages across partitions for load distribution
- Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

### projection-catch-up-advances-position-for-unhandled-events [IN] OBSERVATION
`catch_up` advances `_position` for every event, not just those with registered handlers, so unhandled event types don't create replay gaps
- Source: entries/2026/05/29/event-sourcing-store-event_store-catch_up.md

### projection-catch-up-assumes-sequential-ids [IN] OBSERVATION
`catch_up` uses `from_position=self._position + 1` arithmetic that assumes `event_id` values are contiguous integers with no gaps; IDs with gaps would cause events to be silently skipped
- Source: entries/2026/05/29/event-sourcing-store-event_store-catch_up.md

### projection-catch-up-poison-event-stalls [IN] OBSERVATION
If a handler raises during `catch_up`, `_position` is not advanced past the failing event, causing all subsequent `catch_up` calls to re-encounter and re-fail on the same event indefinitely
- Source: entries/2026/05/29/event-sourcing-store-event_store-catch_up.md

### projection-is-stateless-replay [IN] OBSERVATION
A `Projection`'s `_state` is a cache of derived data, not a source of truth — it can be fully reconstructed from the event log at any time by replaying from position 0
- Source: entries/2026/05/29/topic-cqrs-read-models.md

### projection-snapshot-interval-counts-all-events [IN] OBSERVATION
The snapshot interval counter increments for every event processed during `catch_up`, including those without handlers, so snapshot frequency is based on total event throughput not just handled events
- Source: entries/2026/05/29/event-sourcing-store-event_store-catch_up.md

### protocol-safety-unfalsifiable-under-current-testing [IN] OBSERVATION
Distributed protocol safety claims are unfalsifiable under the current testing methodology: specific protocol gaps (2PC's blocking window with unused timeout, Raft's stale-leader writes) AND general protocol behavior are validated exclusively under synchronous simulation with deterministic delivery — safety violations that manifest only under asynchronous or partitioned delivery are structurally invisible to the test suite.

### protocol-safety-validated-only-under-synchronous-model [IN] OBSERVATION
Distributed protocol safety properties are validated exclusively under synchronous simulation (deterministic tick-based delivery, no real network I/O), but the most critical failure mode — network partitions creating stale-leader write acceptance and forced re-elections — is inherently asynchronous, creating an untested gap between modeled and real-world safety.

### put-append-only [IN] OBSERVATION
`put` never modifies existing on-disk data; overwrites create a new record appended to the active segment and update only the in-memory index, leaving old records as reclaimable garbage until compaction
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-put.md

### put-no-tombstone-guard [IN] OBSERVATION
`put` does not reject the `TOMBSTONE` sentinel as a value; passing it creates a record indistinguishable from a delete, which silently corrupts the key's state on recovery
- Source: entries/2026/05/29/log-structured-hash-table-bitcask-put.md

### put-return-value-ignored-by-coordinator [IN] OBSERVATION
`ReadRepairStore` never checks the boolean return from `Replica.put()` during read repair or writes, relying on version monotonicity to make stale or redundant writes harmless no-ops.
- Source: entries/2026/05/29/read-repair-read_repair.md

### pytest-files-have-broader-coverage [IN] OBSERVATION
The `test_*.py` files are consistently longer than their `tester_test_*.py` counterparts (e.g., 207 vs 125 lines for bloom-filter), containing additional edge case tests beyond the tester's core invariant checks
- Source: entries/2026/05/29/topic-tester-vs-test-convention.md

### pytest-must-run-per-directory [IN] OBSERVATION
Tests must be executed from within each module's directory (e.g., `cd bloom-filter && pytest`) because imports use bare module names with no package prefix or installed package
- Source: entries/2026/05/29/topic-conftest-and-pytest-config.md

### python-bitcask-no-keydir-sharing [IN] OBSERVATION
Each `BitcaskStore` instance independently rebuilds the full in-memory index from disk; there is no shared-memory mechanism to preserve or share the keydir across instances even within the same process, unlike Erlang's ETS-backed keydir
- Source: entries/2026/05/29/topic-erlang-ets-concurrency-model.md

### python-bitcask-no-refcount-or-locking [IN] OBSERVATION
Neither Python Bitcask implementation has reference counting, reader registration, or file-handle locking — compaction can delete segment files while a concurrent reader holds a stale index entry, unlike Erlang Bitcask which defers deletion until the last reader's refcount drops to zero.
- Source: entries/2026/05/29/topic-reference-counted-file-handles.md

### quorum-guarantees-weakened-at-two-levels [IN] OBSERVATION
The distributed storage layer systematically weakens quorum semantics: sloppy quorums count hint storage as successful writes, and sub-quorum configurations (R+W<=N) produce only warnings rather than errors, making it possible to operate without intersection guarantees.

### quorum-violation-warns-not-raises [IN] OBSERVATION
When `R + W <= N`, the `ReadRepairStore` constructor emits a `warnings.warn` rather than raising an exception, allowing intentionally relaxed quorum configurations.
- Source: entries/2026/05/29/read-repair-read_repair.md

### raft-advance-commit-on-tick-and-response [IN] OBSERVATION
`_advance_commit_index` is invoked both on heartbeat ticks and immediately after successful append responses update `_match_index`, allowing eager commit advancement without waiting for the next heartbeat cycle.
- Source: entries/2026/05/29/raft-consensus-raft-_advance_commit_index.md

### raft-cluster-deterministic-simulation [IN] OBSERVATION
`RaftCluster` uses tick-based deterministic simulation with no real clocks, threads, or network I/O; time advances only via explicit `tick(ms)` calls, making partition and election scenarios fully reproducible.
- Source: entries/2026/05/29/raft-consensus-test_raft.md

### raft-commit-index-monotonic [IN] OBSERVATION
`_commit_index` only ever increases within `_advance_commit_index` because the loop starts at `_commit_index + 1` and only assigns forward; it can never decrease or be reset.
- Source: entries/2026/05/29/raft-consensus-raft-_advance_commit_index.md

### raft-commit-requires-strict-majority [IN] OBSERVATION
An entry is committed when `count > total // 2`, meaning for a 5-node cluster at least 3 replicas (including the leader) must have the entry, matching Raft's quorum requirement.
- Source: entries/2026/05/29/raft-consensus-raft-_advance_commit_index.md

### raft-current-term-commit-only [IN] OBSERVATION
`_advance_commit_index` only commits entries whose term matches `_current_term`, implementing the Raft safety property from §5.4.2 that prevents committing prior-term entries by replica count alone.
- Source: entries/2026/05/29/raft-consensus-raft.md

### raft-follower-rejects-writes [IN] OBSERVATION
`RaftNode.client_request()` on a follower returns `{"success": False, "error": "not leader"}` as a structured dict rather than raising an exception; only the leader accepts client writes.
- Source: entries/2026/05/29/raft-consensus-test_raft.md

### raft-is-multi-paxos-optimization [IN] OBSERVATION
The Raft implementation in `raft-consensus/raft.py` embodies the Multi-Paxos leader-lease pattern: one election per term via `_become_leader`, then Phase-2-only replication (AppendEntries) for all entries within that term — heartbeats act as the lease
- Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

### raft-log-direct-indexed [IN] OBSERVATION
Log entries are accessed by log index as a direct Python list index (`self._log[prev_log_index]`), which assumes all entries from index 0 onward are present — log truncation would require an offset or a different data structure
- Source: entries/2026/05/29/topic-raft-log-compaction.md

### raft-log-never-shrinks [IN] OBSERVATION
`self._log` in `RaftNode` is append-only: no method ever removes entries, so the log grows monotonically with client requests and memory usage is unbounded
- Source: entries/2026/05/29/topic-raft-log-compaction.md

### raft-match-index-defaults-zero [IN] OBSERVATION
Peers missing from `_match_index` are treated as having replicated nothing via `dict.get(peer, 0)`; `_become_leader` initializes all peers to 0, so this default is safe but allows graceful handling of unexpected state.
- Source: entries/2026/05/29/raft-consensus-raft-_advance_commit_index.md

### raft-minority-cannot-elect [IN] OBSERVATION
A partition containing fewer than a strict majority of nodes cannot elect a leader; `get_leader()` returns `None` for a 2-of-5 minority partition.
- Source: entries/2026/05/29/raft-consensus-test_raft.md

### raft-no-leader-forwarding [IN] OBSERVATION
`client_request` returns `{"success": False, "error": "not leader"}` if the node is not the leader; there is no automatic forwarding to the current leader.
- Source: entries/2026/05/29/raft-consensus-raft.md

### raft-no-prevote-or-lease [IN] OBSERVATION
The Raft implementation has no PreVote, leader lease, or CheckQuorum mechanism; split-brain safety relies entirely on term numbers and the single-vote-per-term invariant
- Source: entries/2026/05/29/topic-raft-leader-stickiness.md

### raft-no-state-machine-apply [IN] OBSERVATION
`_last_applied` is tracked but never used to actually apply commands to a state machine, making snapshotting impossible without first implementing application logic
- Source: entries/2026/05/29/topic-raft-log-compaction.md

### raft-partition-creates-dual-hazard [IN] OBSERVATION
Network partitions create a compound safety hazard in Raft: the isolated leader silently accepts writes that can never commit, and when it rejoins its inflated term forces a disruptive re-election of the healthy partition's leader.

### raft-partition-via-set [IN] OBSERVATION
Network partitions are simulated by adding node IDs to `_partitioned`; partitioned nodes neither tick nor send/receive messages, modeling complete network isolation.
- Source: entries/2026/05/29/raft-consensus-raft.md

### raft-randomized-timeout-only-split-vote-defense [IN] OBSERVATION
The only mechanism preventing simultaneous candidacy is the randomized election timeout range (default 150–300ms); there is no protocol-level tiebreaker such as PreVote
- Source: entries/2026/05/29/topic-raft-leader-stickiness.md

### raft-rejoin-forces-election [IN] OBSERVATION
A partitioned node that rejoins after repeated failed elections carries an inflated term number, forcing the healthy leader to step down via `_become_follower` and triggering a completely unnecessary cluster-wide election
- Source: entries/2026/05/29/topic-raft-leader-stickiness.md

### raft-sentinel-entry-assumption [IN] OBSERVATION
The Raft log is initialized with a sentinel `LogEntry(term=0, index=0, command=None)` at position 0, and multiple methods depend on this entry always being present at `self._log[0]`
- Source: entries/2026/05/29/topic-raft-log-compaction.md

### raft-sentinel-log-entry [IN] OBSERVATION
The log is initialized with a sentinel `LogEntry(term=0, index=0)` at position 0 that is never removed, eliminating empty-log edge cases in `_last_log_index()` and `_last_log_term()`.
- Source: entries/2026/05/29/raft-consensus-raft.md

### raft-single-tick-delivery [IN] OBSERVATION
`RaftCluster.tick()` collects outbound messages and delivers both the request and its response within the same tick invocation — there is no simulated network latency.
- Source: entries/2026/05/29/raft-consensus-raft.md

### raft-stale-leader-accepts-writes [IN] OBSERVATION
A partitioned leader continues accepting client requests (returning `success: True`) even though those entries can never commit and will be overwritten when the partition heals — the gap a leader lease would close
- Source: entries/2026/05/29/topic-raft-leader-stickiness.md

### raft-terms-strictly-increase [IN] OBSERVATION
Each new leader election produces a strictly higher term than the previous leader; terms never repeat or decrease across leader transitions, as validated by the `test_terms_monotonically_increasing` test.
- Source: entries/2026/05/29/raft-consensus-test_raft.md

### raft-uncommitted-entries-overwritten [IN] OBSERVATION
Uncommitted log entries from a deposed leader are overwritten when the node receives `AppendEntries` from a new leader with conflicting entries at the same index; only committed entries are durable.
- Source: entries/2026/05/29/raft-consensus-test_raft.md

### range-partition-merge-guards-threshold [IN] OBSERVATION
`merge_small_partitions()` only merges adjacent partitions when their combined size stays below `min_partition_size`, preventing immediate re-splitting after merge.
- Source: entries/2026/05/29/range-partitioning-test_range_partitioning.md

### range-partition-merge-is-greedy-left-to-right [IN] OBSERVATION
`merge_small_partitions` sweeps left-to-right and greedily merges adjacent pairs whose combined size is at or below `min_partition_size`, which can produce different results than an optimal merge strategy
- Source: entries/2026/05/29/topic-hash-partitioning-skew.md

### range-partition-no-exceptions-api [IN] OBSERVATION
`RangePartitionedStore` uses return-value signaling: `get()` returns `None` for missing keys and `delete()` returns `False` for absent keys, rather than raising exceptions.
- Source: entries/2026/05/29/range-partitioning-test_range_partitioning.md

### range-partition-sequential-insert-skew [IN] OBSERVATION
Range partitioning has its own skew risk: keys arriving in sorted order all hit the rightmost partition; `Partition.split()` mitigates this for data already stored by splitting at the median, but cannot prevent temporary hotspots during sequential ingestion
- Source: entries/2026/05/29/topic-hash-partitioning-skew.md

### range-partition-split-at-median [IN] OBSERVATION
`Partition.split` always divides at the median key (index `len // 2`), which can produce uneven partitions when key distribution is skewed
- Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

### range-partitioning-auto-split-manual-merge [IN] OBSERVATION
Splits are triggered automatically during `put()` when a partition exceeds `max_partition_size`, but merges require an explicit call to `merge_small_partitions()` and never happen implicitly.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-partitioning-boundary-routing-bisect [IN] OBSERVATION
Key routing uses `bisect_right(boundaries, key) - 1` on a parallel sorted list of partition start keys, giving O(log p) partition lookup where p is the partition count.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-partitioning-contiguous-half-open-coverage [IN] OBSERVATION
Partitions use half-open intervals `[start_key, end_key)` with the first partition starting at `""` and the last having `end_key=None`, guaranteeing the entire string keyspace is covered with no gaps or overlaps.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-partitioning-merge-assumes-adjacency [IN] OBSERVATION
`Partition.merge(other)` appends the other partition's keys without re-sorting; correctness depends on the caller passing the immediate right neighbor, with no runtime adjacency validation.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-partitioning-no-thread-safety [IN] OBSERVATION
No concurrency protection exists on any mutation path (`put`, `delete`, `split`, `merge`); concurrent access would corrupt the sorted key/value lists.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-partitioning-parallel-arrays-routing [IN] OBSERVATION
`_partitions` and `_boundaries` are kept in lockstep as parallel arrays — index `i` in both refers to the same partition — trading O(n) insert/delete for O(1) indexed access and O(log n) binary search routing.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-partitioning-routing-is-structurally-fragile [IN] OBSERVATION
Range partitioning routing depends on maintaining strict lockstep invariants across parallel data structures that are never verified: boundaries and partitions are kept as parallel arrays where any desynchronization silently misroutes keys, merge operations assume partition adjacency without checking, and splits always divide at the median regardless of key distribution skew.

### range-partitioning-split-at-median-index [IN] OBSERVATION
`Partition.split()` divides at `len(keys)//2` (count-based median), not the lexicographic midpoint of the key range, producing equal-count halves but potentially unequal key ranges.
- Source: entries/2026/05/29/range-partitioning-range_partitioning.md

### range-scan-end-key-exclusive [IN] OBSERVATION
`range_scan(start_key, end_key)` treats `end_key` as an exclusive upper bound — `range_scan("k05", "k10")` returns keys `k05` through `k09`
- Source: entries/2026/05/29/topic-range-scan-implementation.md

### range-scan-follows-sibling-chain [IN] OBSERVATION
`range_scan` walks the leaf-level linked list via `next_sibling` pointers rather than re-descending the tree, giving O(height + leaf pages in range) page reads.
- Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

### range-scan-half-open-interval [IN] OBSERVATION
`BTree.range_scan(start_key, end_key)` returns keys in the half-open interval `[start_key, end_key)` — start-inclusive, end-exclusive; `end_key=None` scans to the last key.
- Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

### range-scan-lacks-safety-guarantees [IN] OBSERVATION
Range scans are vulnerable to three independent failure modes: concurrent modifications produce inconsistent results (no snapshot isolation), corrupted sibling pointers can cause infinite traversal (no cycle guard), and large result sets consume unbounded memory (eager materialization into a list).

### range-scan-last-writer-wins [IN] OBSERVATION
`range_scan` iterates sources oldest-to-newest (SSTables, then immutable memtables, then active memtable) so that dict overwrites naturally give each key the value from the newest source
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

### range-scan-materializes-results [IN] OBSERVATION
`range_scan` collects all matching `(key, value)` pairs into a list in memory rather than yielding lazily; large ranges consume proportional memory.
- Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

### range-scan-merges-memtable-and-sstables [IN] OBSERVATION
`range_scan()` combines results from both the active memtable and all SSTables in `self._sstables`, making it sensitive to mutations of either data structure during iteration
- Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

### range-scan-newer-wins-by-priority [IN] OBSERVATION
The range scan merge uses an incrementing priority counter where higher priority = newer source; the dict naturally deduplicates by overwriting older entries for the same key
- Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

### range-scan-no-concurrency-protection [IN] OBSERVATION
`range_scan` can observe inconsistent state if a flush or compaction runs concurrently, since there is no snapshot isolation or locking
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

### range-scan-no-cycle-guard [IN] OBSERVATION
`range_scan` has no visited-set or cycle detection on the leaf sibling chain; a corrupted `next_sibling` pointer would cause an infinite loop.
- Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

### range-scan-no-mutation [IN] OBSERVATION
`Partition.range_scan` is a pure read operation with no side effects — it never modifies `_keys`, `_values`, or any other state, and returns a fresh list the caller can mutate freely.
- Source: entries/2026/05/29/range-partitioning-range_partitioning-range_scan.md

### range-scan-none-end-scans-unbounded [IN] OBSERVATION
Passing `end=None` to `Partition.range_scan` sets the right boundary to `len(self._keys)`, scanning from `start` through the last key with no upper bound.
- Source: entries/2026/05/29/range-partitioning-range_partitioning-range_scan.md

### range-scan-priority-field-unused [IN] OBSERVATION
The priority integer stored in the `merged` dict during `range_scan` is never read or compared; correctness relies entirely on the source iteration order and dict overwrite semantics, making the priority field dead weight
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

### range-scan-pure-read [IN] OBSERVATION
`range_scan` has no side effects: it does not mutate tree state, trigger flushes, or modify the WAL
- Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

### range-scan-relies-on-sorted-keys [IN] OBSERVATION
`range_scan` correctness depends on `_keys` being sorted, an invariant maintained by `put`/`delete`/`split`/`merge` but not enforced structurally — direct mutation of `_keys` bypassing these methods silently produces wrong results.
- Source: entries/2026/05/29/range-partitioning-range_partitioning-range_scan.md

### range-scan-supports-unbounded [IN] OBSERVATION
Calling `range_scan(start_key)` without an `end_key` scans from `start_key` through the last key in the rightmost leaf, terminating at the `NO_SIBLING` sentinel
- Source: entries/2026/05/29/topic-range-scan-implementation.md

### reaches-uses-identity-not-equality [IN] OBSERVATION
`_reaches` uses `is` (object identity) to find the source event, not `==`, because dataclass equality would conflate distinct events with identical field values.
- Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

### reaches-walks-two-edge-types [IN] OBSERVATION
The causal graph has two edge types: `_parent` (same-node sequential) and `_cause` (cross-node send→receive), and `_reaches` must follow both to correctly determine happens-before.
- Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

### read-path-anticipates-concurrent-flush [IN] OBSERVATION
Both `get()` (line 254) and `range_scan()` (line 286) check `_immutable_memtables` in the correct order, indicating the read path was designed for a concurrent-flush architecture that the write path never implements
- Source: entries/2026/05/29/topic-immutable-memtable-gap.md

### read-path-unreliable-from-storage-through-derived-systems [IN] OBSERVATION
The complete read path from storage through derived systems is unreliable at every stage: the SSTable layer has compounding integrity and performance deficiencies (no per-entry checksums, linear miss-probes from missing Bloom integration, redundant key-list rebuilds per lookup) while the CDC backbone feeding all derived systems depends on reconstruction heuristics for event semantics and requires callers to manually flush and provide old values for consistency.

### read-predicate-deps-only-for-committed [IN] OBSERVATION
`read_predicate` records dependency graph edges only for keys in the committed snapshot (`k in snap`), not for keys the transaction itself created via `write()` — avoiding self-dependencies
- Source: entries/2026/05/29/write-skew-detection-ssi_database-read_predicate.md

### read-predicate-stores-result-snapshot [IN] OBSERVATION
`read_predicate` appends a `(predicate, dict(result))` tuple to `tx._predicate_locks` — a deep copy so `commit()` can compare against the original result even after database state changes
- Source: entries/2026/05/29/write-skew-detection-ssi_database-read_predicate.md

### read-repair-resurrects-deleted-keys [IN] OBSERVATION
`DynamoCluster.get()` read repair pushes the highest-versioned value to stale replicas, so a naive delete (removing from `_store`) is undone when a recovering node still holds the old value
- Source: entries/2026/05/29/topic-leaderless-deletion-gap.md

### read-repair-scope-is-quorum-only [IN] OBSERVATION
Read repair in `get()` only fixes replicas that were part of the read quorum, not all replicas; full-cluster repair requires a separate call to `anti_entropy_repair()`.
- Source: entries/2026/05/29/read-repair-read_repair.md

### read-uses-linear-scan-for-offset-gaps [IN] OBSERVATION
`Topic.read` finds the start position by scanning for `msg.offset >= offset` rather than computing an array index, which is necessary because compaction creates offset holes that break arithmetic indexing.
- Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

### readlines-accepts-list-or-filepath [IN] OBSERVATION
`ReadLines` accepts either a `list[str]` of in-memory lines or a file path string as its input source, providing two input modes from the same stage.
- Source: entries/2026/05/29/batch-word-count-test_pipeline.md

### reads-and-writes-same-consensus-path [IN] OBSERVATION
`LinearizableRegister` routes both reads and writes through the same TOB slot mechanism, giving them positions in the same total order — no read-only fast path exists.
- Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

### receive-gossip-death-is-sticky [IN] OBSERVATION
Once a node's local status is `dead`, `receive_gossip` will never change it back to `alive`, even if the remote has a higher heartbeat counter — preventing flapping after failure declaration
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol-receive_gossip.md

### receive-gossip-deep-copies-on-insert [IN] OBSERVATION
`receive_gossip` deep-copies remote entries before inserting them into local membership, extending the isolation guarantee beyond the `send_gossip`/`join`/`get_membership_list` paths covered by `gossip-deep-copy-isolation`
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol-receive_gossip.md

### receive-gossip-no-self-exclusion [IN] OBSERVATION
Unlike `detect_failures` which skips the node's own ID, `receive_gossip` has no self-exclusion guard — if the remote gossip contains an entry for the receiver, it is merged normally
- Source: entries/2026/05/29/gossip-protocol-gossip_protocol-receive_gossip.md

### receive-replica-is-passive-endpoint [IN] OBSERVATION
`_receive_replica` in `vector-clocks/vector_clock.py` is the receiving side of anti-entropy replication; nothing in the codebase calls it automatically — tests simulate the call manually, and no gossip or sync module invokes it
- Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

### reconcile-replaces-all-siblings [IN] OBSERVATION
`reconcile` unconditionally replaces all siblings for a key with a single merged version, regardless of whether the provided contexts cover every existing sibling.
- Source: entries/2026/05/29/vector-clocks-vector_clock.md

### reconstruct-state-is-pure-read [IN] OBSERVATION
`reconstruct_state` never modifies the `EventStore`; it is a read-only fold over a single stream's events that returns a locally-constructed dict
- Source: entries/2026/05/29/event-sourcing-store-event_store-reconstruct_state.md

### reconstruct-state-no-snapshot-support [IN] OBSERVATION
`reconstruct_state` always replays from the beginning of the stream via `read_stream` with default `from_version=0`; it does not leverage snapshots, unlike `Projection`
- Source: entries/2026/05/29/event-sourcing-store-event_store-reconstruct_state.md

### reconstruct-state-up-to-is-global-id [IN] OBSERVATION
The `up_to` parameter in `reconstruct_state` filters on global `event_id`, not stream-local version number, despite operating on a single stream
- Source: entries/2026/05/29/event-sourcing-store-event_store-reconstruct_state.md

### record-format-is-length-prefixed [IN] OBSERVATION
Each WAL record starts with a 4-byte little-endian length prefix (not counting itself), followed by CRC, seq_num, op_type, and length-prefixed key/value pairs — minimum 25 bytes per record for empty key/value
- Source: entries/2026/05/28/topic-wal-checksum-format.md

### record-length-implicitly-validated [IN] OBSERVATION
Corruption of `record_length` causes `_read_record` to consume wrong bytes for the payload, which fails the CRC check indirectly — making explicit CRC coverage of the framing field unnecessary
- Source: entries/2026/05/28/topic-wal-crc-coverage-gap.md

### recover-seq-no-gap-detection [IN] OBSERVATION
`_recover_seq_num` finds the max sequence number but does not detect or report gaps in the sequence space, meaning lost records between rotations or corruption are silently ignored.
- Source: entries/2026/05/29/write-ahead-log-wal-_recover_seq_num.md

### recover-seq-num-is-global-max [IN] OBSERVATION
`_recover_seq_num` returns the maximum sequence number across all WAL files, not just the latest file, ensuring correctness even when records are spread across rotated segments.
- Source: entries/2026/05/29/write-ahead-log-wal-_recover_seq_num.md

### recover-seq-skips-corrupt-files [IN] OBSERVATION
`_recover_seq_num` treats CRC errors as file-local (abandons the corrupted file, continues scanning subsequent files), unlike `_read_all_records` which stops all scanning on corruption.
- Source: entries/2026/05/29/write-ahead-log-wal-_recover_seq_num.md

### recovery-destroys-both-durability-and-isolation [IN] OBSERVATION
A single restart simultaneously destroys both durability and isolation guarantees along independent axes: the WAL's carefully calibrated write-time durability tiers (per-write fsync vs batch-only fsync) become invisible to recovery, and the transaction isolation model's composed invariant layers (MVCC visibility plus SSI conflict detection) break because abort metadata and monotonic counters are non-persistent.

### recovery-requires-participant-availability [IN] OBSERVATION
`Coordinator.recover()` checks `is_available()` before re-sending decisions and skips unavailable participants, meaning a double failure (coordinator crash + participant crash) leaves locks held until both are up and recovery re-runs.
- Source: entries/2026/05/29/topic-2pc-blocking-problem.md

### recovery-scans-all-segments [IN] OBSERVATION
All three storage implementations (WAL, hash-index Bitcask, log-structured Bitcask) scan every segment file on startup to rebuild state, making recovery time proportional to segment count rather than data volume.
- Source: entries/2026/05/29/topic-wal-segment-sizing-tradeoffs.md

### recovery-simultaneously-over-and-under-engineered [IN] OBSERVATION
Recovery is paradoxically both over-engineered and under-implemented: the WAL builds complete sequence number and checkpoint infrastructure that's never consulted during replay, while the actual recovery paths lack batch atomicity and metadata CRC protection — engineering effort was invested in infrastructure that goes unused while the active recovery path remains unsafe.

### reference-implementations-lack-tombstone-ttl [IN] OBSERVATION
None of the DDIA reference implementations combine tombstone support with time-bounded garbage collection; the multi-leader module stores tombstones indefinitely and the Dynamo module has no delete support at all.
- Source: entries/2026/05/29/topic-tombstone-gc-and-repair-window.md

### rename-without-dir-barrier [IN] OBSERVATION
Both Bitcask compaction paths (`hash-index-storage/bitcask.py:297`, `log-structured-merge-tree/bitcask.py:301`) perform `os.rename()` without a subsequent directory fsync, making the rename non-durable on ext4 and XFS
- Source: entries/2026/05/28/topic-directory-fsync-semantics.md

### renew-preserves-token-number [IN] OBSERVATION
`LockService.renew()` extends the lock TTL by mutating `issued_at` and `ttl` without changing the `FencingToken.token` value, while re-`acquire()` by the same client issues a new higher token
- Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

### replay-does-not-enforce-batch-atomicity [IN] OBSERVATION
Despite the docstring claiming replay "skips uncommitted batches," the implementation returns all PUT/DELETE records regardless of whether a matching COMMIT record exists — a crash mid-batch will replay partial batch records
- Source: entries/2026/05/28/write-ahead-log-wal-replay.md

### replay-flush-before-read [IN] OBSERVATION
`replay` flushes the write file descriptor under `self._lock` before reading, ensuring all prior `append` calls are visible on disk during the read phase
- Source: entries/2026/05/28/write-ahead-log-wal-replay.md

### replay-lacks-batch-atomicity-across-implementations [IN] OBSERVATION
Both the unbundled WAL and B-tree WAL replay all CRC-valid records without verifying batch completeness, meaning partial batches from mid-write crashes are silently applied as if they were complete transactions.

### replay-not-lock-protected-during-read [IN] OBSERVATION
The file read phase of `replay` runs without holding `self._lock`, so concurrent `append` calls during replay could produce a partial read of in-flight writes
- Source: entries/2026/05/28/write-ahead-log-wal-replay.md

### replay-returns-eager-list [IN] OBSERVATION
`replay` returns an in-memory `List[WALRecord]` snapshot, not a lazy iterator — the full result set is materialized before returning, which can be large if the WAL is large
- Source: entries/2026/05/28/write-ahead-log-wal-replay.md

### replay-strict-greater-than-filter [IN] OBSERVATION
`replay(after_seq=n)` returns records with `seq_num` strictly greater than `n` (exclusive lower bound), not greater-or-equal — passing 0 replays the entire log
- Source: entries/2026/05/28/write-ahead-log-wal-replay.md

### replica-get-does-not-check-availability [IN] OBSERVATION
`Replica.get` does not consult `self._available`; filtering unavailable replicas is the caller's responsibility via `ReadRepairStore._available_replicas()`.
- Source: entries/2026/05/29/read-repair-read_repair-get.md

### replica-get-has-no-side-effects [IN] OBSERVATION
`Replica.get` performs no mutations, availability checks, or I/O — it is a pure dictionary lookup, unlike `ReadRepairStore.get` which triggers read repair on stale replicas.
- Source: entries/2026/05/29/read-repair-read_repair-get.md

### replica-get-returns-tuple-or-none [IN] OBSERVATION
`Replica.get` returns a `(value, version)` tuple when the key exists, or `None` when it doesn't — callers must guard against `None` before destructuring, as there is no sentinel tuple.
- Source: entries/2026/05/29/read-repair-read_repair-get.md

### replica-put-no-concurrency-safety [IN] OBSERVATION
The version check and store update in `Replica.put` are not atomic (TOCTOU race on lines 22–25); the method is unsafe under concurrent access without external locking.
- Source: entries/2026/05/29/read-repair-read_repair-put.md

### replica-put-rejects-older-versions [IN] OBSERVATION
`Replica.put()` silently rejects writes with a version strictly less than the stored version (returns `False`), enforcing monotonic version advancement; equal versions are accepted as last-writer-wins.
- Source: entries/2026/05/29/read-repair-read_repair.md

### replica-versions-caller-coordinated [IN] OBSERVATION
`Replica.put` does not generate version numbers; version assignment is the responsibility of `ReadRepairStore.put`, which reads the max version across all replicas (including unavailable ones) before writing.
- Source: entries/2026/05/29/read-repair-read_repair-put.md

### replica-versions-start-at-one [IN] OBSERVATION
Versions assigned by `ReadRepairStore.put` start at 1 (computed as `max_version + 1` with initial max of 0), so a version of 0 in `ReadRepairStore.get` return values means "key not found," not "initial version."
- Source: entries/2026/05/29/read-repair-read_repair-get.md

### reprepare-merges-by-sequence [IN] OBSERVATION
When the new primary merges prepared sets from VIEW_CHANGE messages, it deduplicates by sequence number using first-writer-wins without verifying that all nodes prepared the same request at that sequence — relying on the PBFT safety proof that at most one request can be prepared per (view, seq).
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_handle_view_change.md

### resolve-record-dead-code [IN] OBSERVATION
`writer_field_map` is constructed but never referenced in `_resolve_record`, making it dead code likely left from a refactor
- Source: entries/2026/05/29/avro-serializer-avro_serializer-_resolve_record.md

### resolve-record-shared-defaults [IN] OBSERVATION
Default values in `_resolve_record` are inserted by reference without deep copying, so mutable defaults (lists, dicts) would be shared across all decoded records that use the default
- Source: entries/2026/05/29/avro-serializer-avro_serializer-_resolve_record.md

### resolve-record-skip-unknown [IN] OBSERVATION
Writer fields not present in the reader schema are fully consumed from the buffer via `_skip()` and discarded, ensuring the byte stream stays correctly positioned for subsequent fields
- Source: entries/2026/05/29/avro-serializer-avro_serializer-_resolve_record.md

### resolve-record-two-pass [IN] OBSERVATION
`_resolve_record` uses a two-pass algorithm: pass 1 consumes all writer fields from the buffer in wire order (decoding or skipping each), pass 2 assembles the result dict in reader field order from decoded values and defaults
- Source: entries/2026/05/29/avro-serializer-avro_serializer-_resolve_record.md

### resource-token-tracking-is-independent [IN] OBSERVATION
Each resource in `FencedResourceServer` maintains its own `_highest_token` entry, so a high token on resource A does not cause rejection of a lower token on resource B.
- Source: entries/2026/05/29/topic-redlock-controversy.md

### ring-requeues-via-apply-remote-change [IN] OBSERVATION
`apply_remote_change` appends accepted changes to the receiving node's `_pending` list, enabling store-and-forward propagation in ring topology; without this re-enqueue, changes would stop at the first hop.
- Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

### ring-requires-n-minus-one-rounds [IN] OBSERVATION
`RING` topology advances changes by exactly one hop per `sync()` round due to snapshot-isolated pending queue draining, requiring N-1 rounds for a single-source change to reach all N nodes.
- Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

### ring-topology-convergence-is-linear-in-node-count [IN] OBSERVATION
Ring topology requires O(N) sync rounds for full convergence because each round advances changes by exactly one hop via store-and-forward requeuing, unlike fully-connected topologies that converge in one round.

### ring-topology-needs-multiple-sync-rounds [IN] OBSERVATION
A 3-node ring topology requires at least 2 `sync()` rounds to achieve full convergence, unlike a fully-connected topology which converges in 1 round.
- Source: entries/2026/05/29/multi-leader-replication-test_multi_leader.md

### rotation-always-fsyncs [IN] OBSERVATION
The WAL `_rotate()` method unconditionally calls `flush()` and `os.fsync()` before closing the current segment, providing a durability checkpoint even in `none` mode
- Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

### rotation-preserves-fd-invariant [IN] OBSERVATION
After `_maybe_rotate` returns, `self._fd` is guaranteed non-None and open for writing (assuming it was non-None on entry), because `_rotate()` always opens a new file.
- Source: entries/2026/05/29/write-ahead-log-wal-_maybe_rotate.md

### saturation-boundary-untested [IN] OBSERVATION
No test in `test_bloom_filter.py` exercises the counter saturation boundary (16+ collisions at a single position) to verify deletion correctness under overflow
- Source: entries/2026/05/29/topic-counter-overflow-and-cuckoo-filters.md

### save-snapshot-deep-copies-state [IN] OBSERVATION
`save_snapshot` uses `copy.deepcopy` to isolate the stored snapshot from subsequent mutations to `_state`
- Source: entries/2026/05/29/topic-snapshot-storage-format.md

### scalable-bloom-tightens-fpr [IN] OBSERVATION
Each new slice in `ScalableBloomFilter` uses FPR of `p * (ratio ^ slice_index)`, so the aggregate false positive rate stays bounded even as the filter grows unboundedly
- Source: entries/2026/05/28/topic-bloom-filter-integration.md

### scan-range-corrupt-data-silent [IN] OBSERVATION
If `_read_entry` encounters truncated data mid-range, `_scan_range_for_key` silently returns `(False, None)` rather than raising an error, making corruption indistinguishable from a missing key
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-_scan_range_for_key.md

### scan-range-sorted-order-assumption [IN] OBSERVATION
`_scan_range_for_key` assumes entries between `start` and `end` are sorted by key ascending; it breaks early on `k > key`, so unsorted data causes silent false negatives
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-_scan_range_for_key.md

### scan-segment-ordering-invariant [IN] OBSERVATION
`_scan_segment` must be called on segments in ascending segment-ID order for the index to correctly reflect last-write-wins semantics; no runtime check enforces this ordering — it is the caller's responsibility.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-_scan_segment.md

### schema-registry-in-memory-only [IN] OBSERVATION
`SchemaRegistry` is a pure in-memory dict with monotonic ID assignment, with no persistence, HTTP API, or subject-based grouping
- Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

### schema-registry-no-compat-enforcement [IN] OBSERVATION
Schema registration accepts any schema without checking compatibility against previously registered versions under the same subject, unlike Confluent which enforces backward/forward/full compatibility on registration
- Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

### secondary-index-document-class-unused [IN] OBSERVATION
The `Document` dataclass exists as a domain concept but the DB classes accept `pk` and `fields` separately and never consume `Document` instances.
- Source: entries/2026/05/29/secondary-index-partitioning-secondary_index_partitioning.md

### secondary-index-no-error-on-missing [IN] OBSERVATION
`get()` on a nonexistent key returns `OperationResult(data=None)` and `query_by_field()` for absent values returns empty results rather than raising exceptions; documents missing indexed fields are stored but not indexed.
- Source: entries/2026/05/29/secondary-index-partitioning-test_secondary_index_partitioning.md

### secondary-index-stale-entry-cleanup [IN] OBSERVATION
Both `DocumentPartitionedDB` and `TermPartitionedDB` remove stale index entries before writing new ones during `put` and `delete`, preventing phantom results from old field values.
- Source: entries/2026/05/29/secondary-index-partitioning-secondary_index_partitioning.md

### seen-set-keyed-by-ts-origin [IN] OBSERVATION
`_seen` maps each key to a set of `(timestamp, origin_node_id)` tuples; a change is skipped if its exact `(ts, origin)` pair is already in the set for that key
- Source: entries/2026/05/29/topic-ring-topology-propagation.md

### segment-rotation-limits-corruption-blast-radius [IN] OBSERVATION
Both Bitcask implementations rotate to new segment files at a configurable size threshold, bounding the maximum data loss from stop-at-first-error to the records trailing the corruption within a single segment
- Source: entries/2026/05/29/topic-record-boundary-recovery.md

### seq-num-corruption-is-silent [IN] OBSERVATION
A bit flip in a WAL record's `seq_num` passes CRC validation and is silently accepted with the wrong sequence number, potentially corrupting replay filtering, truncation boundaries, and recovery sequence numbering
- Source: entries/2026/05/28/topic-wal-crc-coverage-gap.md

### seq-num-crc-fix-is-zero-cost [IN] OBSERVATION
Adding `seq_num` to the CRC input requires changing two lines (`_encode_record` and `_read_record`) with no measurable runtime overhead, since CRC32 over 9 extra bytes is negligible
- Source: entries/2026/05/29/topic-seq-num-integrity-gap.md

### serialization-has-no-overflow-check [IN] OBSERVATION
`_serialize_leaf` packs all entries into a buffer without comparing its length to `page_size`; overflow prevention is entirely the caller's responsibility
- Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

### serialize-leaf-is-pure [IN] OBSERVATION
`_serialize_leaf` is a pure function with no side effects; all I/O happens in the caller via `_wal_write_page`
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_serialize_leaf.md

### sibling-direction-matters [IN] OBSERVATION
Proof siblings are tagged "left" or "right" because SHA-256 concatenation is order-dependent — `H(A||B) != H(B||A)` — so swapping sibling position produces a different parent hash and verification fails.
- Source: entries/2026/05/29/topic-merkle-proof-security-model.md

### sibling-field-is-structural-only [IN] OBSERVATION
Leaf pages carry a `next_sibling` field in their serialized format (`NO_SIBLING = 0xFFFFFFFF` sentinel), but no tree operation uses sibling pointers for traversal or merging — it is scaffolding, not a live data structure. Note: may refine existing `btree-leaf-sibling-chain`.
- Source: entries/2026/05/28/topic-concurrent-btree-deletion.md

### sibling-pointers-leaf-only [IN] OBSERVATION
Sibling pointers exist only on leaf nodes (`_serialize_leaf` at line 182); internal nodes have no right-links, making this a B+-tree, not a B-link tree
- Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

### sibling-preservation-on-concurrent-writes [IN] OBSERVATION
`VersionedKVStore.put` never drops an existing version unless the new version's clock dominates it; concurrent versions are preserved as siblings.
- Source: entries/2026/05/29/vector-clocks-vector_clock.md

### silent-corruption-returns-garbage [IN] OBSERVATION
A bit-flip in a value region of `hash-index-storage/bitcask.py` causes `get()` to return corrupted data with no error; the only partial integrity check is the key-equality assertion
- Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

### silent-data-loss-is-default-operational-mode [IN] OBSERVATION
Silent data loss is the default operational mode: corruption propagates through every data transformation pipeline without detection (compaction copies without CRC validation, hint generation skips integrity checks), and the testing methodology cannot observe durability failures (no crash tests, no fsync verification), meaning data degrades continuously with no feedback signal to operators.

### silent-mode-drops-all-outbound [IN] OBSERVATION
`SILENT` Byzantine mode returns an empty list for all outbound messages, but the node still processes incoming messages and updates internal state — simulating a crash fault rather than a protocol violation
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_apply_byzantine.md

### single-file-design-enables-atomic-sync [IN] OBSERVATION
`PageManager` stores metadata (page 0) and data pages in a single file, so one `os.fsync()` call in `sync()` covers both; splitting into separate files would require multiple fsyncs with ordering constraints in the commit fence
- Source: entries/2026/05/29/topic-wal-commit-vs-metadata-sync-ordering.md

### single-kv-size-validated [IN] OBSERVATION
`BTree.put` raises `ValueError` for any single key-value pair that cannot fit in a page, blocking the case where even one entry would overflow
- Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

### single-threaded-by-design [IN] OBSERVATION
`BTree` has zero synchronization primitives; the sibling chain, split logic, and page allocation are correct only because no concurrent reader or writer can observe intermediate states
- Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

### single-threaded-masks-immutable-gap [IN] OBSERVATION
In the synchronous single-threaded LSM execution model, the missing immutable memtable lifecycle step causes no observable data loss because `_flush()` completes atomically within the caller's execution flow
- Source: entries/2026/05/29/topic-immutable-memtable-gap.md

### size-tiered-triggers-on-sstable-count [IN] OBSERVATION
Size-tiered compaction triggers when the total SSTable count meets the `min_threshold` parameter
- Source: entries/2026/05/29/sstable-and-compaction-tester_test_sstable.md

### snapshot-excludes-uncommitted-writes [IN] OBSERVATION
`_snapshot` only includes committed versions (`commit_ts <= snapshot_ts`); all three call sites manually overlay `tx._writes` and `tx._deletes` when they need the transaction-local view
- Source: entries/2026/05/29/write-skew-detection-ssi_database-_snapshot.md

### snapshot-keyed-by-projection-name [IN] OBSERVATION
Each projection's snapshot is stored under `self.name` in `_store._snapshots`, so two projections with the same name on the same store will overwrite each other
- Source: entries/2026/05/29/topic-snapshot-storage-format.md

### snapshot-loss-triggers-full-replay [IN] OBSERVATION
When event sourcing snapshots are missing, the system falls back to replaying all events from the log to reconstruct aggregate state — correct but O(n) in total event count
- Source: entries/2026/05/29/topic-snapshot-persistence.md

### snapshot-returns-detached-copy [IN] OBSERVATION
`SSIDatabase._snapshot` returns a new dict each call; callers in `read_predicate` and `commit` mutate it freely (overlaying writes, removing deletes) without corrupting the MVCC store
- Source: entries/2026/05/29/write-skew-detection-ssi_database-_snapshot.md

### snapshot-state-excludes-subscription [IN] OBSERVATION
Snapshot data (`_store._snapshots`) contains only `state` and `position`; subscriber callbacks registered in `EventStore._subscribers` are not captured, so restoring a snapshot does not restore live-update behavior
- Source: entries/2026/05/29/topic-projection-snapshot-interaction.md

### snapshot-uses-next-timestamp-for-current-state [IN] OBSERVATION
During commit validation, `_snapshot(self._next_timestamp)` captures all committed versions because `_next_timestamp` is strictly greater than any existing `commit_ts`
- Source: entries/2026/05/29/write-skew-detection-ssi_database-_snapshot.md

### snapshots-are-in-memory-only [IN] OBSERVATION
Projection snapshots are stored in `store._snapshots` (an in-memory dict), not persisted to disk, so they are lost on process restart
- Source: entries/2026/05/29/event-sourcing-store-event_store.md

### snapshots-dict-is-monkey-patched [IN] OBSERVATION
The `_snapshots` attribute is not declared on `EventStore`; it is lazily created by `Projection.save_snapshot` via `hasattr`/setattr on the store instance
- Source: entries/2026/05/29/topic-snapshot-storage-format.md

### sort-external-merge [IN] OBSERVATION
`Sort` stage spills sorted chunks to temp JSONL files when the in-memory buffer exceeds `memory_limit`, then uses `heapq.merge` with a `KeyedRecord` wrapper (seq tiebreaker) for stable k-way merge
- Source: entries/2026/05/29/batch-word-count-pipeline.md

### sort-merge-join-detects-presorted-input [IN] OBSERVATION
`SortMergeJoin` inspects whether input is already sorted and reports this via `stats["sorted_input"]`, allowing it to skip the sort step on pre-sorted data
- Source: entries/2026/05/29/map-side-join-test_map_side_joins.md

### sparse-index-limits-corruption-blast-radius [IN] OBSERVATION
A corrupted data entry in an SSTable only affects lookups within one sparse index segment (between two adjacent index offsets, default 16 entries); keys indexed by other segments remain correctly accessible
- Source: entries/2026/05/29/topic-sstable-footer-integrity.md

### split-brain-delivers-two-hops [IN] OBSERVATION
Election messages from demoted leaders are delivered one level deep (election message + immediate response), not recursively to convergence; the next `tick()` finishes any remaining propagation.
- Source: entries/2026/05/29/leader-election-leader_election-_resolve_split_brain.md

### split-brain-ignores-unavailable-nodes [IN] OBSERVATION
Only nodes where `is_available()` is `True` are considered when detecting and resolving split-brain; crashed nodes' stale leader state is ignored.
- Source: entries/2026/05/29/leader-election-leader_election-_resolve_split_brain.md

### split-brain-not-recorded-in-history [IN] OBSERVATION
Elections triggered by `_resolve_split_brain` do not call `_record_leader`, so leadership changes from split-brain resolution are absent from `election_history`.
- Source: entries/2026/05/29/leader-election-leader_election-_resolve_split_brain.md

### split-brain-runs-after-message-drain [IN] OBSERVATION
`_resolve_split_brain` executes only after the tick's `while all_messages` delivery loop has fully drained, never mid-delivery.
- Source: entries/2026/05/29/leader-election-leader_election-_resolve_split_brain.md

### split-chain-rewiring-order [IN] OBSERVATION
During a leaf split, the new right page is written with the old leaf's `next_sibling` before the old leaf is rewritten to point to the new page; WAL ordering ensures the pointer target is durable before anything references it
- Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

### ssi-commit-returns-dict-not-exception [IN] OBSERVATION
`SSIDatabase.commit()` returns a structured dict `{"committed": bool, "reason": str|None, "conflicts": list}` rather than raising, letting callers inspect what conflicted for retry logic
- Source: entries/2026/05/29/write-skew-detection-ssi_database.md

### ssi-extends-mvcc-with-dependency-tracking [IN] OBSERVATION
`SSIDatabase` adds a `_dependency_graph` and `_predicate_locks` on top of MVCC to detect read-write conflicts (write skew) that basic snapshot isolation permits
- Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

### ssi-own-writes-require-buffer-check [IN] OBSERVATION
`SSIDatabase.read` must check `tx._writes` and `tx._deletes` before consulting `_store`, since a transaction's own writes aren't materialized into the shared store until commit — unlike MVCCDatabase where eager writes make own-writes visible automatically
- Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

### ssi-pessimistic-mode-aborts-at-read [IN] OBSERVATION
When `_pessimistic` is true, `_check_read_conflict` raises `RuntimeError` and sets `tx._status = "aborted"` at read time rather than waiting for commit — eager detection vs optimistic validation
- Source: entries/2026/05/29/write-skew-detection-ssi_database.md

### ssi-phantom-detection-uses-predicate-reevaluation [IN] OBSERVATION
Phantom detection re-evaluates each stored predicate function against current committed state at commit time, comparing both key sets and values to the snapshot-time result
- Source: entries/2026/05/29/write-skew-detection-ssi_database.md

### ssi-read-only-skip-validation [IN] OBSERVATION
Read-only transactions (empty write set) bypass all conflict detection in `commit()` and always commit successfully — they cannot cause write skew because they don't write
- Source: entries/2026/05/29/write-skew-detection-ssi_database.md

### ssi-serializability-through-layered-invariants [IN] OBSERVATION
SSI's commit path relies on three mutually reinforcing properties: read-only transactions skip validation (safe because the store contains only committed data from prior commits), and within a single transaction, writes and deletes for the same key are mutually exclusive, preventing conflicting modifications to that key within the transaction's own write set.

### ssi-store-contains-only-committed-data [IN] OBSERVATION
`SSIDatabase._store` only contains versions stamped with commit timestamps; uncommitted writes live in `tx._writes` until commit, so every version in the store is guaranteed to be from a committed transaction
- Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

### ssi-visibility-is-timestamp-comparison [IN] OBSERVATION
`SSIDatabase._visible_value` determines visibility by a single comparison (`commit_ts <= snapshot_ts`) with no active-set tracking, because the deferred-write model guarantees all stored versions are committed
- Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

### ssi-writes-deletes-mutually-exclusive [IN] OBSERVATION
`write()` discards any pending `delete()` for the same key and vice versa; a key is in `tx._writes` XOR `tx._deletes`, never both simultaneously
- Source: entries/2026/05/29/write-skew-detection-ssi_database.md

### sstable-block-size-is-index-interval [IN] OBSERVATION
`SSTableWriter`'s `block_size` parameter controls the sparse index frequency (one index entry per N records), not a physical byte-aligned block size; records are written contiguously with no padding or alignment to fixed byte boundaries.
- Source: entries/2026/05/29/topic-block-based-wal-format.md

### sstable-compaction-manager-copies-list [IN] OBSERVATION
`CompactionManager.get_sstables()` returns `list(self._sstables)` (a shallow copy), preventing iterator invalidation from concurrent mutation but not preventing file-deletion races on the underlying SSTable files
- Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

### sstable-compaction-manager-lacks-bottommost-detection [IN] OBSERVATION
The `CompactionManager` tracks `_max_levels` (default 7) but has no logic to check whether lower levels contain overlapping key ranges before deciding tombstone removal, unlike LevelDB/RocksDB which use `VersionSet` metadata for this
- Source: entries/2026/05/29/topic-leveled-compaction-tombstone-policy.md

### sstable-crash-leaves-zero-count-header [IN] OBSERVATION
A crash between SSTableWriter creation and `finish()` leaves `entry_count=0` in the file header while data records exist on disk; `SSTableReader` trusts the header count and reads the file as empty, silently losing all written entries.
- Source: entries/2026/05/29/topic-crash-recovery-invariants.md

### sstable-entry-types-binary [IN] OBSERVATION
The SSTable format supports exactly two entry types (value and tombstone via `TOMBSTONE_MARKER`); there is no third type for merge operands, which would be required for RocksDB-style merge support
- Source: entries/2026/05/29/topic-rocksdb-merge-operator.md

### sstable-format-lacks-integrity-and-efficiency-mechanisms [IN] OBSERVATION
The SSTable format has no integrity or efficiency safeguards: entries carry no CRC or checksum, range scans bypass the sparse index and scan from file start, and the entry count header is trusted without verification — meaning corruption goes undetected, negative lookups are unnecessarily expensive, and truncated files produce silent data loss.

### sstable-get-miss-returns-none [IN] OBSERVATION
`SSTableReader.get()` returns `None` for keys not found in the table, rather than raising an exception
- Source: entries/2026/05/29/sstable-and-compaction-tester_test_sstable.md

### sstable-get-rebuilds-key-list-every-call [IN] OBSERVATION
`SSTable.get` extracts sparse index keys into a fresh list on every call rather than caching them, making each lookup O(m) in index size before the binary search begins
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-get.md

### sstable-get-requires-prior-load-index [IN] OBSERVATION
For SSTables loaded from disk, `load_index()` must be called before `get()` or every lookup silently returns `(False, None)` — the method does not call `load_index()` itself
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-get.md

### sstable-has-magic-and-version [IN] OBSERVATION
The SSTable format (`sstable.py:10-11`) uses a 4-byte magic `b"SSTB"` and a 2-byte version field, making it the only binary format in the repo with format versioning
- Source: entries/2026/05/29/topic-wal-record-format-evolution.md

### sstable-has-timestamps-lsm-ignores-them [IN] OBSERVATION
`SSTableEntry` in `sstable.py` carries a `timestamp` field, but `lsm.py`'s merge and conflict resolution uses only structural SSTable ordering (sequence number priority), not per-entry timestamps
- Source: entries/2026/05/29/topic-leveldb-merging-iterator.md

### sstable-layer-compounds-integrity-and-performance-deficiencies [IN] OBSERVATION
The SSTable layer has mutually reinforcing integrity and performance deficiencies: no per-entry CRC or checksum means corrupted entries return wrong values silently, and no Bloom filter integration forces every negative lookup to scan all SSTables sequentially, maximizing both the probability and frequency of encountering those silent corruptions.

### sstable-lookup-is-two-phase [IN] OBSERVATION
Point lookups in both SSTable implementations first binary-search the sparse index to identify a block, then linearly scan entries within that block's byte range; total cost is O(log(N/B) + B)
- Source: entries/2026/05/29/topic-sstable-block-format.md

### sstable-magic-is-only-integrity-check [IN] OBSERVATION
The 4-byte magic `b"SSTB"` at offset 0 is the only data integrity check in the `sstable-and-compaction` SSTable read path; it gates file-type validation but provides no corruption detection for keys, values, timestamps, or structural offsets
- Source: entries/2026/05/29/topic-sstable-magic-number-vs-crc.md

### sstable-merge-accepts-reader-objects [IN] OBSERVATION
`merge_sstables()` takes `SSTableReader` instances directly as inputs, using a streaming iterator interface rather than loading entire files into memory
- Source: entries/2026/05/29/sstable-and-compaction-tester_test_sstable.md

### sstable-merge-dedup-newest-timestamp [IN] OBSERVATION
When multiple SSTables contain the same key, `merge_sstables` keeps only the entry with the highest timestamp via heap ordering `(key, -timestamp)`, silently discarding all older versions.
- Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

### sstable-merge-forward-only [IN] OBSERVATION
The k-way merge in `sstable-and-compaction/sstable.py` is forward-only with no direction tracking or reverse iteration support; there is no direction enum, no `FindLargest`, and no heap rebuild on direction change.
- Source: entries/2026/05/29/topic-leveldb-merger.md

### sstable-metadata-has-level-field [IN] OBSERVATION
`SSTableMetadata` in `sstable-and-compaction/sstable.py` tracks a `level` field, indicating the compaction module is structured for leveled compaction even though `lsm.py` uses flat single-level merges
- Source: entries/2026/05/29/topic-mvcc-version-snapshots.md

### sstable-metadata-level-field-unused [IN] OBSERVATION
`SSTableMetadata.level` in `sstable-and-compaction/sstable.py` exists as a field but has no persistent backing; level assignments would be lost on restart without a MANIFEST equivalent
- Source: entries/2026/05/28/topic-manifest-based-compaction.md

### sstable-no-checksums [IN] OBSERVATION
The SSTable format has no CRC or checksum on entries or blocks; the only validation is a magic-byte assertion (`MAGIC == b"SSTB"`) on file open.
- Source: entries/2026/05/28/sstable-and-compaction-sstable.md

### sstable-range-scan-end-exclusive [IN] OBSERVATION
`range_scan(start, end)` is inclusive on `start` and exclusive on `end`.
- Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

### sstable-range-scan-no-index-skip [IN] OBSERVATION
`SSTableReader.range_scan(start, end)` scans from the beginning of the file rather than using the sparse index to skip ahead, making it O(N) regardless of range position.
- Source: entries/2026/05/28/sstable-and-compaction-sstable.md

### sstable-scan-is-streaming [IN] OBSERVATION
`SSTable.scan()` and `scan_all()` use Python generators (yield), meaning individual SSTables already support streaming — only the cross-SSTable merge layer materializes
- Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

### sstable-short-read-guards-prevent-overrun [IN] OBSERVATION
`_read_entry()` in the LSM SSTable checks for short reads on key and value fields, preventing buffer overruns on truncated files but silently skipping all remaining entries in the scan range
- Source: entries/2026/05/29/topic-sstable-footer-integrity.md

### sstable-sorted-order-caller-responsibility [IN] OBSERVATION
`SSTableWriter.add()` does not enforce sorted key order; violation silently corrupts binary search in `SSTableReader.get()`.
- Source: entries/2026/05/28/sstable-and-compaction-sstable.md

### sstable-sparse-index-bounds-corruption [IN] OBSERVATION
A corrupted SSTable data entry only affects lookups within one sparse index segment; the sparse index provides independent entry points into different file regions, bounding corruption blast radius to ~`block_size` entries
- Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

### sstable-sparse-index-every-nth [IN] OBSERVATION
`SSTableReader` loads a sparse index into memory that records every `block_size`-th key and byte offset; point lookups binary-search the index then linearly scan within the block.
- Source: entries/2026/05/28/sstable-and-compaction-sstable.md

### sstable-sparse-index-in-footer [IN] OBSERVATION
Both SSTable implementations store the sparse index at the end of the file with a footer pointer, meaning bloom filters could be co-located in the same footer region without changing the file layout
- Source: entries/2026/05/29/topic-bloom-filter-optimization.md

### sstable-test-is-script-not-framework [IN] OBSERVATION
`test_sstable.py` is a linear script using bare `assert` statements and `print` progress markers, not a pytest/unittest suite — it runs top-to-bottom in a `tempfile.TemporaryDirectory` context.
- Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

### sstable-tombstone-encoding-0xff [IN] OBSERVATION
Tombstones are encoded on disk as a single `0xFF` byte in the value position; this is unambiguous because valid value-length fields are 4 bytes big-endian and cannot start with `0xFF` at realistic value sizes.
- Source: entries/2026/05/28/sstable-and-compaction-sstable.md

### sstable-tombstone-is-null-value [IN] OBSERVATION
Tombstones are represented at the API level as `SSTableEntry` objects with `value=None`, not as a separate deletion record type.
- Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

### sstable-trailer-single-point-of-failure [IN] OBSERVATION
The SSTable's 12-byte trailer (`footer_start` offset + entry `count`) is an unprotected single point of failure; its corruption makes the entire file's sparse index and data entries unreachable with no fallback discovery mechanism
- Source: entries/2026/05/29/topic-sstable-footer-integrity.md

### sstable-write-path-has-dual-fragility [IN] OBSERVATION
The SSTable write path has two independent fragility points: sorted key order is the caller's responsibility with no enforcement (violation silently corrupts binary search), and the file header is written as a placeholder then patched via seek-back after all entries are written (a crash between data write and header patch leaves a structurally invalid file).

### sstable-writer-append-then-patch [IN] OBSERVATION
`SSTableWriter` writes a placeholder header, appends all entries and the sparse index, then seeks back to byte 0 to patch the final entry count — an unfinished SSTable has `entry_count=0` and appears empty.
- Source: entries/2026/05/28/sstable-and-compaction-sstable.md

### sstable-writer-finish-no-fsync [IN] OBSERVATION
`sstable-and-compaction/sstable.py:SSTableWriter.finish()` closes the file without calling `flush()` or `os.fsync()`, relying on implicit Python close-time buffer flush with no disk durability guarantee.
- Source: entries/2026/05/29/topic-fsync-guarantees-across-implementations.md

### sstable-writer-is-natural-filter-builder [IN] OBSERVATION
The `SSTableWriter.add()` → `finish()` lifecycle is the natural integration point for per-SSTable bloom filter construction, since keys are already iterated in sorted order during the write pass
- Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

### sstable-writer-no-fsync-on-close [IN] OBSERVATION
`SSTableWriter.finish()` at `sstable-and-compaction/sstable.py:91` calls `close()` without prior `fsync()`, leaving newly flushed SSTables vulnerable to loss if the OS page cache hasn't been written to disk
- Source: entries/2026/05/29/topic-bitcask-crash-recovery-guarantees.md

### sstable-writer-single-file-lifecycle [IN] OBSERVATION
`SSTableWriter` binds to one file descriptor at construction with no method to rotate to a new file mid-write, making it structurally incompatible with size-bounded output splitting
- Source: entries/2026/05/29/topic-compaction-output-splitting.md

### stage-timing-subtracts-downstream [IN] OBSERVATION
`Stage._tracked_process` subtracts time spent after each `yield` (downstream processing time) to attribute wall-clock time accurately to each individual stage
- Source: entries/2026/05/29/batch-word-count-pipeline.md

### standard-bloom-suffices-for-immutable-sstables [IN] OBSERVATION
Standard bloom filters lack deletion support, but SSTables are write-once-then-discarded, so deletion is never needed and the 8x space savings over counting filters is free
- Source: entries/2026/05/29/topic-bloom-filter-on-sstables.md

### stcs-compact-mutates-sstable-list [IN] OBSERVATION
`_stcs_compact` both removes old and appends new entries to `self._sstables` as a side effect; callers must not hold references into the list across a compaction call.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_stcs_compact.md

### stcs-merges-all-qualifying-buckets-per-call [IN] OBSERVATION
Size-tiered `run_compaction()` processes every bucket that meets `min_threshold` in a single call, while leveled compaction processes at most one level per call.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

### stcs-merges-within-size-tiers [IN] OBSERVATION
Size-tiered compaction only merges SSTables in the same size tier (determined by `_get_tier()` comparing `file_size` against `_size_thresholds`); it never merges a small SSTable directly with a much larger one.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_stcs_compact.md

### stcs-min-threshold-default-four [IN] OBSERVATION
A size tier must accumulate at least 4 SSTables (the default `_min_threshold`) before `_stcs_compact` merges them; buckets below this threshold are skipped entirely.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-_stcs_compact.md

### stcs-overlapping-key-ranges [IN] OBSERVATION
Size-tiered compaction allows multiple SSTables at the same tier to contain overlapping key ranges, requiring point lookups to check all SSTables in a tier
- Source: entries/2026/05/29/topic-leveled-vs-size-tiered-compaction.md

### storage-crash-recovery-has-no-safe-path [IN] OBSERVATION
No storage engine has a fully safe crash recovery path: compaction lacks atomicity, WAL replay ignores batch boundaries, and CRC checksums leave routing metadata unprotected — corruption can enter via unprotected metadata, persist through non-validating compaction, and survive recovery via batch-unaware replay.

### storage-has-no-self-healing-at-any-layer [IN] OBSERVATION
Storage engines degrade monotonically during normal operation (leaked pages, growing height, no rebalancing) and have no safe recovery after crashes (non-atomic compaction, partial batch replay, unchecked metadata) — there is no path from degraded state back to healthy state.

### storage-operations-have-unbounded-memory-consumption [IN] OBSERVATION
Three core storage operations materialize entire datasets in memory with no streaming, pagination, or backpressure: LSM range scans load all matching entries into a dict before returning, compaction buffers all merged entries into a list before writing, and SSTable lookups rebuild the sparse index key list on every call — creating O(n) memory pressure proportional to total data size rather than result size.

### store-range-scan-passes-global-bounds [IN] OBSERVATION
`RangePartitionedStore.range_scan` passes the same global `start_key`/`end_key` to every overlapping partition rather than clamping to partition boundaries, relying on `bisect_left` to naturally exclude out-of-range keys.
- Source: entries/2026/05/29/range-partitioning-range_partitioning-range_scan.md

### strategy-enum-covers-two-of-three [IN] OBSERVATION
`ConflictStrategy` implements LWW and custom merge but not CRDTs; the CRDT module exists separately in `crdts.py` with no integration into the multi-leader replication strategy dispatch
- Source: entries/2026/05/29/topic-conflict-resolution-strategies.md

### strategy-selection-is-binary [IN] OBSERVATION
`CompactionManager` compares `strategy` with `==` against `"size_tiered"` — any other string silently selects leveled compaction with no validation or error.
- Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

### stream-join-bounds-materialization-via-watermarks [IN] OBSERVATION
`StreamJoinProcessor` expires buffered events when they fall below `watermark - window_duration`, bounding memory at the cost of potentially dropping late-arriving matches
- Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

### stream-join-buffers-bounded-by-window [IN] OBSERVATION
The processor actively expires events so buffer sizes remain proportional to `window_duration × event_rate`, not total events processed; tests confirm buffers stay under 10 per side after 1000 events with a 5-second window.
- Source: entries/2026/05/29/stream-join-processor-test_stream_join_processor.md

### stream-join-correctness-depends-on-window-alignment [IN] OBSERVATION
Stream join correctness depends on consistent alignment across three independent window mechanisms: join and aggregation windows are configured independently with no constraint preventing misalignment, expiration uses a one-sided cutoff that doesn't match the symmetric matching predicate, and buffer size is bounded only by the window duration relative to watermark advancement rate.

### stream-join-get-results-destructive [IN] OBSERVATION
`get_results()` swaps out and resets the accumulated results list; callers see only results since the previous drain (pull-based consumption model)
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### stream-join-inner-requires-key-and-window-match [IN] OBSERVATION
Inner join only produces a `JoinResult` when both key equality and `|t_left - t_right| <= window_duration` hold; either condition failing alone prevents a match.
- Source: entries/2026/05/29/stream-join-processor-test_stream_join_processor.md

### stream-join-late-events-dropped-silently [IN] OBSERVATION
Events arriving after `watermark - allowed_lateness` are dropped and counted in `stats.late_events_dropped`; the processor never raises exceptions for late arrivals.
- Source: entries/2026/05/29/stream-join-processor-test_stream_join_processor.md

### stream-join-left-asymmetric [IN] OBSERVATION
`JoinType.LEFT` emits misses only for unmatched left-side events; unmatched right-side events are silently dropped without any miss emission
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### stream-join-miss-deferred [IN] OBSERVATION
Outer-join misses are only emitted at event expiration time (when timestamp falls below watermark minus window duration), never eagerly on arrival
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### stream-join-no-stream-validation [IN] OBSERVATION
The processor does not validate that an event's `stream_name` matches either configured stream; an unknown stream name is silently treated as the right stream via `_is_left` returning `False`
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### stream-join-one-to-many [IN] OBSERVATION
A single event can match multiple events on the opposite side; the matched flag is set on all participants, so none produce outer-join misses
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### stream-join-process-event-returns-immediate-matches [IN] OBSERVATION
`process_event()` returns matches synchronously on each call (push-based) rather than batching to a flush boundary; results are available the instant a match is found.
- Source: entries/2026/05/29/stream-join-processor-test_stream_join_processor.md

### stream-join-watermark-monotonic [IN] OBSERVATION
The join processor's watermark only advances forward via `max(current, event.timestamp)`; `advance_time()` silently ignores timestamps at or below the current watermark
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### stream-version-is-event-count [IN] OBSERVATION
`stream_version()` returns the count of events in a stream (not the latest `event_id`), and `expected_version` checks against this count for optimistic concurrency
- Source: entries/2026/05/29/event-sourcing-store-event_store.md

### subscriber-dispatch-is-synchronous [IN] OBSERVATION
Event subscriber notification in `EventStore` blocks the `append()` caller — projections update on the writer's thread before the append call returns, trading throughput for zero-gap consistency
- Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

### sync-all-two-rounds-suffices [IN] OBSERVATION
`CRDTReplicaGroup.sync_all` runs exactly 2 rounds of all-pairs sync, which is sufficient for convergence of any state-based CRDT with a commutative/associative/idempotent merge
- Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

### sync-mode-none-skips-per-write-fsync [IN] OBSERVATION
When `sync_mode="none"`, the per-write `_do_sync()` call (default `force=False`) is a complete no-op — no fsync and no batch queue — trading durability for maximum write throughput
- Source: entries/2026/05/29/topic-sync-mode-none-safety.md

### system-converges-on-permanent-dark-failure [IN] OBSERVATION
The system converges on permanent dark failure: there is no operational path to correctness at any scale AND failure is irreversible and undetectable, meaning the system silently accumulates data loss with no diagnostic signal, operational remedy, or evolutionary escape — even the decision to rebuild requires evidence the system cannot produce.

### system-degrades-monotonically-at-every-abstraction-level [IN] OBSERVATION
The system degrades monotonically at every abstraction level with no equilibrium: storage engines erode structurally with no self-healing (leaked pages, growing height, no rebalancing, unsafe recovery), and the distributed layer's compensating mechanisms (anti-entropy, read repair) cannot fully resolve the resulting replica divergence, which accumulates without bound.

### system-failure-is-undetectable-and-irreversible [IN] OBSERVATION
The system's natural trajectory is toward permanent undetectable data loss: corruption propagates silently through every pipeline without detection, AND the structural degradation that enables it is irreversible at every abstraction level — failure is invisible while it happens and unfixable after it's discovered.

### system-has-no-stable-operational-regime [IN] OBSERVATION
The system has no stable operational regime at any scale: storage degrades monotonically during normal operation with no self-healing, failures widen correctness gaps at every abstraction level, and no storage paradigm can escape by scaling (hash indexes hit memory walls, LSM hits probe walls) — the architecture converges toward failure under every operating condition.

### system-has-no-undo-at-any-layer [IN] OBSERVATION
No layer in the system can reverse a completed operation: the WAL stores only new values with no before-images (making rollback structurally impossible), and transaction abort is a status-flag change that leaves all written versions on disk, meaning neither the storage layer nor the logical transaction layer supports undo.

### system-neither-verifiable-nor-repairable [IN] OBSERVATION
The system is trapped in a verification-repair deadlock: verification is impossible at every architectural layer (integrity checks degrade along the pipeline, protocol safety is unfalsifiable under current testing) AND the rigid binary format design across the storage stack prevents adding verification or self-healing capabilities through evolution, foreclosing both detection and correction of faults.

### term-partitioned-point-query-is-targeted [IN] OBSERVATION
`TermPartitionedDB.query_by_field` looks up a single index partition by term hash then fetches documents from their home partitions, touching 1 + K partitions (K = distinct data partitions) instead of scattering to all.
- Source: entries/2026/05/29/secondary-index-partitioning-secondary_index_partitioning.md

### term-partitioned-write-touches-multiple [IN] OBSERVATION
`TermPartitionedDB.put` touches 1 + N partitions where N depends on how many distinct index partitions the document's field values hash to, because index entries live on term-hashed partitions separate from the data partition.
- Source: entries/2026/05/29/secondary-index-partitioning-secondary_index_partitioning.md

### test-block-size-differs-from-default [IN] OBSERVATION
Tests universally use `block_size=4` while `SSTableWriter` defaults to 64 and `lsm.py` defaults to 16; the small test value forces multi-block index paths with small datasets
- Source: entries/2026/05/29/topic-sstable-block-format.md

### tester-dev-suites-overlap-not-hierarchical [IN] OBSERVATION
The tester and developer test suites have overlapping but non-hierarchical coverage: tester files uniquely test spec-example compliance and cross-path equivalence, while dev files uniquely test crash recovery and internal state
- Source: entries/2026/05/29/topic-test-coverage-gaps.md

### tester-files-are-generated-artifacts [IN] OBSERVATION
The `tester_test_*.py` files are generated by the code-expert workflow's tester stage from implementation specs, distinct from hand-written `test_*.py` pytest files
- Source: entries/2026/05/29/topic-tester-runner-system.md

### tester-files-are-hand-written [IN] OBSERVATION
The `tester_test_*.py` files contain no auto-generation markers and are hand-authored standalone test suites, not generated by code-expert or any other pipeline — contradicts existing `tester-files-are-generated-artifacts`
- Source: entries/2026/05/29/topic-code-expert-generation-pipeline.md

### tester-files-are-standalone-runnable [IN] OBSERVATION
Every `tester_test_*.py` file contains an `if __name__ == "__main__":` block, making it executable via `python` without pytest
- Source: entries/2026/05/29/topic-tester-vs-test-convention.md

### tester-files-never-import-internals [IN] OBSERVATION
Tester test files import only the public API (e.g., `from btree import BTree`), while developer test files import internal types like `WAL`, `_serialize_leaf`, and `HEADER_FMT` for state injection and internal invariant checking
- Source: entries/2026/05/29/topic-test-coverage-gaps.md

### tester-files-use-stdout-validation [IN] OBSERVATION
Tester files print `"test_name PASSED"` to stdout, indicating an output-parsing runner rather than pytest's exit-code-based reporting
- Source: entries/2026/05/29/topic-tester-vs-test-convention.md

### tester-naming-avoids-pytest-discovery [IN] OBSERVATION
The `tester_test_*.py` prefix does not match pytest's default collection patterns (`test_*.py` / `*_test.py`), so these files are excluded from `pytest` runs unless explicitly specified
- Source: entries/2026/05/29/topic-tester-vs-test-convention.md

### tester-print-includes-diagnostics [IN] OBSERVATION
Tester files embed runtime metrics in their pass/fail output (tree height, page counts, pass/fail ratios) that pytest's structured reporting does not surface without the `-s` flag
- Source: entries/2026/05/29/topic-tester-vs-pytest-test-duality.md

### tester-pytest-boundary-is-leaky [IN] OBSERVATION
At least 6 `tester_test_*.py` files import pytest despite being designed for standalone execution, indicating the two suites share lineage rather than being fully independent
- Source: entries/2026/05/29/topic-tester-vs-test-convention.md

### tester-spec-compliance-is-unique [IN] OBSERVATION
`TestSpecExample` tests in tester files replay the exact spec usage example verbatim as a living executable spec, a form of spec-drift detection that developer test suites do not replicate
- Source: entries/2026/05/29/topic-test-coverage-gaps.md

### tester-test-decomposition [IN] OBSERVATION
Tester files decompose monolithic pytest tests into smaller focused functions with docstrings (e.g., `test_basic` becomes `test_basic_put_get` + `test_range_scan` + `test_persistence`)
- Source: entries/2026/05/29/topic-code-expert-generation-pipeline.md

### tester-test-dual-runner [IN] OBSERVATION
Some `tester_test_*.py` files are pytest-only (using fixtures/`pytest.raises`), while others include an `if __name__ == '__main__':` block for standalone execution; the two execution patterns coexist across modules
- Source: entries/2026/05/29/topic-tester-test-file-locations.md

### tester-test-files-not-excluded [IN] OBSERVATION
No `collect_ignore`, `norecursedirs`, `testpaths`, or `python_files` directive exists anywhere in the repo, so `tester_test_*.py` files will be collected by pytest's default `*_test.py` glob
- Source: entries/2026/05/29/topic-pytest-configuration.md

### tester-test-no-shared-harness [IN] OBSERVATION
There is no shared `conftest.py`, base class, or test harness across `tester_test_*.py` files; each imports only its own module under test plus standard library utilities and is fully self-contained
- Source: entries/2026/05/29/topic-tester-test-file-locations.md

### tester-tests-are-distilled-specs [IN] OBSERVATION
Each `tester_test_*.py` is a simplified subset of its corresponding `test_*.py`, organized by numbered behavioral properties (`# 1. No false negatives`, `# 2. FPR within 2x of target`) rather than exhaustive edge cases
- Source: entries/2026/05/29/topic-tester-generation-pipeline.md

### tester-tests-not-auto-regenerated [IN] OBSERVATION
`tester_test_*.py` files contain no auto-generation markers (`generated`, `DO NOT EDIT`) and no regeneration tooling exists in the documented code-expert workflow; they are static committed files
- Source: entries/2026/05/29/topic-tester-generation-pipeline.md

### tester-tests-public-contract [IN] OBSERVATION
`tester_test_*.py` files test the public API contract of each implementation, importing only public symbols (plus format constants for fault injection), acting as external conformance harnesses
- Source: entries/2026/05/29/log-structured-hash-table-tester_test_bitcask.md

### tester-versions-have-more-test-functions [IN] OBSERVATION
The `tester_test_*.py` files consistently contain more test functions than their `test_*.py` counterparts (bloom-filter: 11 vs 5, gossip-protocol: 12 vs 10), and include tests for features not covered in the default-discovered files
- Source: entries/2026/05/29/topic-pytest-default-collection.md

### testing-covers-neither-crash-nor-async-failure-modes [IN] OBSERVATION
The testing methodology covers neither single-node crash failures nor distributed asynchronous failures: crash recovery paths are systematically untested (no torn-write tests, no compaction crash tests), and distributed protocols are validated only under synchronous deterministic simulation with no real network I/O — the entire failure surface area from storage crashes to network partitions is invisible to the test suite.

### tob-competing-proposals-bump-rounds [IN] OBSERVATION
When a TOB slot is decided by another proposer with a different value, the losing node re-proposes its value for a new slot rather than retrying the same slot with a higher round
- Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

### tob-consensus-uses-paxos-ballot-numbers [IN] OBSERVATION
`ConsensusInstance.prepare(n)` / `accept(n, val)` use monotonic proposal numbers where a higher prepare preempts a lower one, causing `accept` with a stale number to return `{accepted: False}`
- Source: entries/2026/05/29/total-order-broadcast-test_total_order_broadcast.md

### tob-contiguous-delivery [IN] OBSERVATION
Messages are delivered to the application only when all prior slots are decided; `_next_slot` advances through a contiguous run, and a gap (undecided slot N when N+1 is decided) blocks all later delivery
- Source: entries/2026/05/29/total-order-broadcast-total_order_broadcast.md

### tob-full-paxos-per-slot [IN] OBSERVATION
Every slot in Total Order Broadcast runs the complete two-phase Paxos protocol (Prepare→Promise→Accept→Accepted) from scratch; no state carries between slots to skip Phase 1
- Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

### tob-linearizable-reads-via-broadcast [IN] OBSERVATION
`LinearizableRegister.read()` broadcasts the read through consensus rather than reading locally, establishing a consistent point in the total order — necessary for linearizability without leader leases
- Source: entries/2026/05/29/total-order-broadcast-total_order_broadcast.md

### tob-linearizable-register-is-built-on-broadcast [IN] OBSERVATION
`LinearizableRegister` wraps `TOBCluster` to provide `write()`/`read()`/`compare_and_set()`, demonstrating the DDIA Chapter 9 equivalence between total-order broadcast and linearizable storage
- Source: entries/2026/05/29/total-order-broadcast-test_total_order_broadcast.md

### tob-no-leader-concept [IN] OBSERVATION
TOBNode has no leader election or lease mechanism; any node can propose for any undecided slot at any time, making it vulnerable to competing proposals
- Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

### tob-no-persistent-storage [IN] OBSERVATION
TOBNode crash/recovery uses state transfer from peers via `force_decide()` on each missed slot; there is no WAL, persistent log, or on-disk state — all durability depends on a majority staying alive
- Source: entries/2026/05/29/total-order-broadcast-total_order_broadcast.md

### tob-paxos-per-slot [IN] OBSERVATION
Each slot in the total order log is decided by an independent single-decree Paxos instance (`ConsensusInstance`); there is no stable-leader optimization or Multi-Paxos Phase 1 skip
- Source: entries/2026/05/29/total-order-broadcast-total_order_broadcast.md

### tob-preempted-value-requeued [IN] OBSERVATION
When a proposer's value loses its slot to a competing value, the original is pushed back onto `_pending` for proposal in a later slot, ensuring no broadcast messages are silently dropped
- Source: entries/2026/05/29/total-order-broadcast-total_order_broadcast.md

### tob-proposal-number-encodes-node-id [IN] OBSERVATION
Proposal numbers are computed as `round * num_nodes + node_id`, guaranteeing uniqueness across nodes within the same round but restarting at round 0 for each slot independently
- Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

### tob-quorum-is-strict-majority [IN] OBSERVATION
The TOB cluster uses 3 nodes where 2 can make progress but 1 alone cannot, confirming a strict-majority quorum requirement for both Paxos Phase 1 and Phase 2
- Source: entries/2026/05/29/total-order-broadcast-test_total_order_broadcast.md

### tob-recovery-replays-missed-slots [IN] OBSERVATION
After `recover_node()`, the recovered node's delivery order matches live nodes' order, meaning recovery replays all consensus decisions made while the node was down via state transfer
- Source: entries/2026/05/29/total-order-broadcast-test_total_order_broadcast.md

### tob-uses-single-decree-paxos-per-slot [IN] OBSERVATION
Each slot in the total order broadcast is decided by an independent `ConsensusInstance` running single-decree Paxos with prepare/accept phases (`total_order_broadcast.py:3`).
- Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

### tob-value-adoption-on-prepare [IN] OBSERVATION
In `_handle_prepare_response`, if any acceptor reports a previously accepted value, the proposer must adopt the one with the highest proposal number — this is the core Paxos safety invariant preventing decided values from being overwritten
- Source: entries/2026/05/29/total-order-broadcast-total_order_broadcast.md

### tombstone-lifecycle-fragmented-across-modules [IN] OBSERVATION
Tombstone management is handled differently at every layer: the LSM uses an ambiguous empty-bytes sentinel indistinguishable from empty values, compaction preserves tombstones by default with caller-controlled removal, and distributed deletion requires cross-replica convergence before safe removal — with no coordination between these concerns.

### tombstone-marker-single-byte-no-redundancy [IN] OBSERVATION
The tombstone/value distinction in `sstable-and-compaction/sstable.py` rests on a single byte (`0xFF` vs first byte of a 4-byte value length), making it vulnerable to single-bit corruption that silently converts live entries to deletions or vice versa
- Source: entries/2026/05/29/topic-sstable-magic-number-vs-crc.md

### tombstone-removal-requires-full-key-coverage [IN] OBSERVATION
A tombstone for key K can only be safely removed during compaction if every SSTable that could contain an older entry for K is included in that compaction run; otherwise the deleted key can be resurrected from a surviving SSTable
- Source: entries/2026/05/28/topic-tombstone-gc-safety.md

### tombstone-removes-cross-segment [IN] OBSERVATION
A tombstone record in segment N removes the key from `_index` even if the key's live value was written in an earlier segment, because the index is a flat dict shared across all segment scans.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-_scan_segment.md

### tombstone-reported-as-none-in-conflict [IN] OBSERVATION
`ConflictRecord.remote_value` reports `None` for tombstoned changes rather than the internal `_TOMBSTONE` sentinel; callers see deletion as `None` and never observe the sentinel object.
- Source: entries/2026/05/29/multi-leader-replication-multi_leader-apply_remote_change.md

### tombstone-sentinel-is-a-forbidden-value [IN] OBSERVATION
Storing `b"__BITCASK_TOMBSTONE__"` as a legitimate value in `log-structured-hash-table` causes silent data loss on next recovery, as `_scan_segment` interprets it as a delete — the sentinel is an implicit API constraint with no validation
- Source: entries/2026/05/29/topic-tombstone-encoding.md

### topology-creates-divergence-window-not-correctness-gap [IN] OBSERVATION
Replication topology affects only the duration of observable divergence, not the final convergence outcome: all topologies use identical deterministic LWW resolution, but ring topology creates O(N) rounds of divergence while all-to-all converges in O(1) — with no adaptive topology switching, the system cannot minimize the divergence window in response to cluster size or partition conditions.

### topology-does-not-change-conflict-outcome [IN] OBSERVATION
Both topologies use the same deterministic `(timestamp, node_id)` comparison for LWW resolution; topology affects when and where conflicts are detected, not which value wins.
- Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

### torn-length-prefix-causes-silent-skip [IN] OBSERVATION
If a torn write corrupts the 4-byte length prefix in `wal.py:_read_record`, the reader interprets garbage as `record_length`, reads that many bytes (consuming valid subsequent records as data), then returns `None` on short read — no error is raised and no resync is attempted
- Source: entries/2026/05/29/topic-block-aligned-wal-records.md

### transaction-isolation-composes-two-invariant-layers [IN] OBSERVATION
Serializable snapshot isolation is achieved by composing two complementary invariant layers: MVCC provides visibility correctness (append-only versions, own-writes guarantee, symmetric deletion visibility) while SSI adds serializability enforcement (read-only optimization, committed-only snapshots, write-delete mutual exclusion).

### transaction-isolation-fragile-under-restart [IN] OBSERVATION
Transaction isolation's carefully composed two-invariant-layer model (MVCC visibility plus SSI conflict detection) is fragile under restart: abort is a status-change-only operation that leaves written data on disk, and MVCC counters that must be monotonic have no persistence mechanism, so a crash both resurrects aborted writes and breaks the ordering invariant that determines visibility.

### transaction-recovery-has-no-crash-path [IN] OBSERVATION
Transaction abort is a status-change-only operation with no disk rollback, and MVCC correctness depends on monotonic counters that have no persistence mechanism, meaning a crash simultaneously leaves aborted-transaction writes on disk and resets the ordering counters that determine visibility — both invariants fail together.

### transfer-dict-keyed-by-arc [IN] OBSERVATION
The transfer dict uses `(arc_start, arc_end)` tuples as keys, so two vnodes producing identical arc boundaries would silently overwrite rather than accumulate.
- Source: entries/2026/05/29/topic-transfer-map-accuracy.md

### transfer-test-weak-assertions [IN] OBSERVATION
`test_add_node_returns_transfers` only asserts transfer direction (A→B) and existence, not arc non-overlap or total size correctness — the code is correct but the test doesn't prove it.
- Source: entries/2026/05/29/topic-transfer-map-accuracy.md

### tree-height-monotonically-increases [IN] OBSERVATION
Since neither the reference implementation nor PostgreSQL merges internal nodes on delete, tree height only grows (on root splits) and never shrinks, even under heavy deletion
- Source: entries/2026/05/28/topic-postgres-nbtree-lazy-deletion.md

### truncate-advances-base-offset-additively [IN] OBSERVATION
`Topic.truncate` updates `_base_offsets[partition]` with `+= actual`, a relative shift, because it only removes a contiguous prefix from the front of the partition log.
- Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

### truncate-plus-per-record-crc-is-dangerous-combination [IN] OBSERVATION
The non-atomic `truncate()` in `wal.py` can produce reordered or incomplete files that pass per-record CRC validation; chained checksums would detect the corruption at the chain break point
- Source: entries/2026/05/29/topic-cumulative-checksums.md

### tumbling-window-aligned-to-zero [IN] OBSERVATION
`TumblingWindowAggregator` aligns window boundaries to multiples of the window size starting from 0 — windows are `[0, size)`, `[size, 2*size)`, etc., not relative to the first event's timestamp.
- Source: entries/2026/05/29/stream-join-processor-test_stream_join_processor.md

### tumbling-window-floor-division [IN] OBSERVATION
`TumblingWindowAggregator` assigns windows by floor-dividing the timestamp by window size, producing aligned non-overlapping boundaries regardless of when events arrive
- Source: entries/2026/05/29/stream-join-processor-stream_join_processor.md

### two-sstable-implementations-same-pattern [IN] OBSERVATION
`lsm.py` (using `sparse_index_interval=16` and `bisect`) and `sstable.py` (using `block_size=64` and manual binary search) implement the same sparse-index-with-block-scan pattern with different defaults, naming, and file format maturity
- Source: entries/2026/05/29/topic-sstable-block-format.md

### two-wal-designs-in-repo [IN] OBSERVATION
The standalone `wal.py` uses a logical WAL (keyed PUT/DELETE operations with COMMIT markers) while `btree.py` uses a physical WAL (raw page images); they solve different problems and have different recovery semantics
- Source: entries/2026/05/29/topic-write-ahead-logging-fsync-ordering.md

### unbundled-catchup-rebuild-equivalence [IN] OBSERVATION
Catch-up via `snapshot_and_stream` and rebuild via full CDC event replay must produce identical derived-system state, verified by comparing `get_state()` output from two independent derived systems
- Source: entries/2026/05/29/unbundled-database-test_tester_validation.md

### unbundled-db-catch-up-replays-history [IN] OBSERVATION
A derived system added with `catch_up=True` receives all historical CDC events to reach current state, without requiring a separate snapshot mechanism.
- Source: entries/2026/05/29/unbundled-database-test_unbundled_database.md

### unbundled-db-cdc-events-carry-old-value [IN] OBSERVATION
Every update and delete CDC event includes the previous value (`old_value`), enabling derived systems to undo prior state; inserts have `old_value=None`.
- Source: entries/2026/05/29/unbundled-database-tester_test_unbundled_database.md

### unbundled-db-composes-via-log [IN] OBSERVATION
The unbundled database wires independent subsystems (WAL, storage engine, CDC, derived systems) through a shared append-only log with independent consumer positions, applying Unix-style composition at the system architecture level
- Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

### unbundled-db-flush-required-for-derived [IN] OBSERVATION
Derived systems (secondary indexes, materialized views, full-text search) only see mutations after `db.flush()` is called; writes go to WAL and storage immediately but CDC consumers are decoupled.
- Source: entries/2026/05/29/unbundled-database-test_unbundled_database.md

### unbundled-db-is-log-first-cdc-is-state-first [IN] OBSERVATION
The unbundled database writes to WAL before storage engine, while the CDC module writes to in-memory rows before appending the log — opposite ordering of the source-of-truth relationship
- Source: entries/2026/05/29/topic-cdc-flush-semantics.md

### unbundled-db-put-returns-cdc-event [IN] OBSERVATION
`UnbundledDatabase.put()` and `delete()` return `CDCEvent` objects directly, making CDC a synchronous, first-class part of the write API rather than a side-channel.
- Source: entries/2026/05/29/unbundled-database-test_unbundled_database.md

### unbundled-db-rebuild-equals-live [IN] OBSERVATION
`rebuild_system()` must produce state identical to incremental live processing — this is a tested invariant verified by capturing `get_state()` before and after rebuild.
- Source: entries/2026/05/29/unbundled-database-test_unbundled_database.md

### unbundled-db-wal-lsn-starts-at-one [IN] OBSERVATION
The unbundled database WAL assigns LSNs starting from 1 (not 0), with `latest_lsn == 0` indicating an empty log.
- Source: entries/2026/05/29/unbundled-database-tester_test_unbundled_database.md

### unbundled-db-wal-persistence-jsonl [IN] OBSERVATION
The unbundled database's WAL supports optional file-backed persistence via a `.jsonl` file path, and a new `WriteAheadLog` instance recovers entries from that file on init.
- Source: entries/2026/05/29/unbundled-database-test_unbundled_database.md

### unbundled-flush-zeroes-lag [IN] OBSERVATION
After `db.flush()`, `get_lag()` returns 0 for all derived systems, confirming all pending CDC events have been consumed
- Source: entries/2026/05/29/unbundled-database-test_tester_validation.md

### unbundled-lsn-sequential [IN] OBSERVATION
LSNs returned by the unbundled database's `put()`/`delete()` are 1-indexed and strictly sequential with no gaps
- Source: entries/2026/05/29/unbundled-database-test_tester_validation.md

### unbundled-rebuild-clears-state [IN] OBSERVATION
`StorageEngine.rebuild(wal)` clears all existing data (including manually injected entries) before replaying WAL entries, ensuring no phantom or stale state survives a rebuild
- Source: entries/2026/05/29/unbundled-database-test_tester_validation.md

### unbundled-wal-entries-ordered-by-lsn [IN] OBSERVATION
WAL entries in the unbundled database's `_entries` list are always in ascending LSN order, maintained by the sequential nature of `append()`
- Source: entries/2026/05/29/unbundled-database-unbundled_database-truncate_before.md

### unbundled-wal-truncate-keeps-gte [IN] OBSERVATION
The unbundled database WAL cutoff is exclusive-below: entries with `lsn >= cutoff` are retained, entries with `lsn < cutoff` are discarded
- Source: entries/2026/05/29/unbundled-database-unbundled_database-truncate_before.md

### unbundled-wal-truncate-memory-only [IN] OBSERVATION
`WriteAheadLog.truncate_before` in the unbundled database removes entries from memory but does not modify the on-disk WAL file, so reloading from `persist_path` restores truncated entries
- Source: entries/2026/05/29/unbundled-database-unbundled_database-truncate_before.md

### unbundled-wal-truncate-preserves-lsn [IN] OBSERVATION
Truncation in the unbundled database WAL never resets `_next_lsn`, so appending after truncation continues with monotonically increasing LSNs
- Source: entries/2026/05/29/unbundled-database-unbundled_database-truncate_before.md

### unfenced-server-accepts-all [IN] OBSERVATION
`UnfencedResourceServer` has no token parameter on `write()` and accepts all writes unconditionally, serving as the pedagogical unsafe baseline for comparison with `FencedResourceServer`
- Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

### unhandled-event-types-silently-skipped [IN] OBSERVATION
Events whose `event_type` has no entry in the handlers dict are skipped without warning in both `reconstruct_state` and `Projection.catch_up`
- Source: entries/2026/05/29/event-sourcing-store-event_store-reconstruct_state.md

### utf8-decode-unguarded [IN] OBSERVATION
If a segment contains a key with invalid UTF-8 bytes, `_scan_segment` raises an unhandled `UnicodeDecodeError` that aborts recovery for all subsequent segments.
- Source: entries/2026/05/28/log-structured-hash-table-bitcask-_scan_segment.md

### values-must-be-bytes [IN] OBSERVATION
B-tree values are written as raw bytes with no encoding; keys accept `str` (auto-encoded to UTF-8) or `bytes`, but values must already be `bytes`
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_serialize_leaf.md

### vc-prune-is-lossy [IN] OBSERVATION
Pruning permanently discards causal information; subsequent `compare()` or `dominates()` calls treat pruned nodes as counter=0, which can produce false concurrency or false dominance
- Source: entries/2026/05/29/vector-clocks-vector_clock-prune.md

### vc-prune-keeps-highest-counters [IN] OBSERVATION
`VectorClock.prune(n)` retains the `n` entries with the highest counter values and discards the rest, using `sorted` descending by counter
- Source: entries/2026/05/29/vector-clocks-vector_clock-prune.md

### vc-prune-tiebreak-unstable [IN] OBSERVATION
When multiple nodes share the same counter value during `prune`, which survive depends on Python's `sorted` stability over dict iteration order — not a deterministic tiebreak policy callers can rely on
- Source: entries/2026/05/29/vector-clocks-vector_clock-prune.md

### vector-clock-compare-partial-order [IN] OBSERVATION
`VectorClock.compare` returns one of four string values — `BEFORE`, `AFTER`, `EQUAL`, or `CONCURRENT` — implementing a partial order where the symmetric property holds (`BEFORE` ↔ `AFTER`)
- Source: entries/2026/05/29/vector-clocks-test_vector_clock.md

### vector-clock-immutability [IN] OBSERVATION
`VectorClock` is immutable: `increment`, `merge`, and `prune` all return new instances and never modify `_clock` in place.
- Source: entries/2026/05/29/vector-clocks-vector_clock.md

### verification-impossible-at-every-layer [IN] OBSERVATION
Verification of system correctness is impossible at every layer of the architecture: data integrity verification degrades from partial to absent along the storage pipeline (WAL CRCs exclude metadata, SSTables have no checksums at all), while protocol safety claims are unfalsifiable because all distributed testing uses synchronous simulation that cannot exercise the asynchronous failure modes the protocols are designed to tolerate.

### verify-proof-direction-unvalidated [IN] OBSERVATION
The `direction` field in proof siblings is not validated; any value other than `"left"` is silently treated as `"right"` via the else branch.
- Source: entries/2026/05/29/merkle-tree-merkle_tree-verify_proof.md

### verify-proof-hash-consistency [IN] OBSERVATION
`verify_proof` concatenates hex-encoded hash strings and encodes to bytes before hashing, matching the exact same scheme used in `__init__` to build internal nodes — a mismatch would silently break all proof verification.
- Source: entries/2026/05/29/merkle-tree-merkle_tree-verify_proof.md

### verify-proof-is-pure-static [IN] OBSERVATION
`verify_proof` is a `@staticmethod` with no side effects; it requires no tree instance, only the data and a `MerkleProof`, so it can run on a different machine than the one that built the tree.
- Source: entries/2026/05/29/merkle-tree-merkle_tree-verify_proof.md

### verify-proof-root-trust [IN] OBSERVATION
`verify_proof` confirms data matches a given root hash but does not authenticate the root itself; callers must obtain a trusted root through an independent channel (e.g., a signed block header).
- Source: entries/2026/05/29/merkle-tree-merkle_tree-verify_proof.md

### version-scan-includes-unavailable [IN] OBSERVATION
`put()` determines the next version by scanning all replicas including those marked unavailable, which prevents version regression but couples version assignment to unavailable node state.
- Source: entries/2026/05/29/read-repair-read_repair.md

### versioned-store-sibling-semantics [IN] OBSERVATION
`VersionedKVStore.get` returns a list of `VersionedValue` entries; concurrent (causally unrelated) writes produce multiple siblings, and the store never auto-resolves conflicts — clients must call `reconcile`
- Source: entries/2026/05/29/vector-clocks-test_vector_clock.md

### versioned-value-no-tombstone-flag [IN] OBSERVATION
The `VersionedValue` dataclass (`dynamo.py:14-18`) carries only `value`, `version`, and `node_id` with no field to distinguish a live value from a deletion marker, making correct distributed deletes impossible without schema changes
- Source: entries/2026/05/29/topic-leaderless-deletion-gap.md

### view-change-contagion [IN] OBSERVATION
Receiving a VIEW_CHANGE from another node causes a non-primary node to broadcast its own VIEW_CHANGE if it hasn't already, propagating the vote through the cluster without requiring all nodes to independently detect the faulty primary.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_handle_view_change.md

### view-change-requires-2f-plus-1 [IN] OBSERVATION
The new primary only acts on a view change after collecting at least 2f+1 VIEW_CHANGE messages, matching the standard PBFT quorum requirement for liveness.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_handle_view_change.md

### view-never-decreases [IN] OBSERVATION
`_handle_view_change` silently drops any message with `msg.view <= self.current_view`, enforcing monotonically non-decreasing view progression across all nodes.
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_handle_view_change.md

### visibility-is-pure [IN] OBSERVATION
`_is_visible` is a pure function with no side effects; it reads `_committed`, `_aborted`, `tx.active_at_start`, and version fields but never mutates any state.
- Source: entries/2026/05/29/snapshot-isolation-mvcc_database-_is_visible.md

### visibility-requires-three-conditions [IN] OBSERVATION
A version is visible to a transaction only if its creator committed, was not in the reader's `active_at_start` set, and has a lower `tx_id` — all three must hold
- Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

### vnode-150-guarantees-sub-1.5-imbalance [IN] OBSERVATION
With 3 equally-weighted nodes and 150 vnodes each, `load_imbalance()` is asserted to stay below 1.5 (`test_consistent_hashing.py:49`).
- Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

### wal-all-mutations-under-lock [IN] OBSERVATION
Every WAL method that writes records or modifies files (`append`, `append_batch`, `checkpoint`, `truncate`) acquires `self._lock` before performing any I/O
- Source: entries/2026/05/29/topic-wal-size-check-toctou.md

### wal-append-mode-no-overwrite [IN] OBSERVATION
`_open_latest` opens WAL files in `"ab"` (append-binary) mode, making it impossible for post-crash writes to overwrite pre-crash records; this is the bridge between recovery and forward progress
- Source: entries/2026/05/29/topic-wal-crash-recovery-semantics.md

### wal-append-no-validation [IN] OBSERVATION
`append` performs no validation of `op_type` against `OP_BYTES`; invalid operation names raise `KeyError` from the dictionary lookup, not a descriptive error.
- Source: entries/2026/05/29/write-ahead-log-wal-append.md

### wal-append-not-transactional [IN] OBSERVATION
Individual `append` calls are not wrapped in any transaction boundary; for atomic multi-operation writes, `append_batch` must be used instead.
- Source: entries/2026/05/29/write-ahead-log-wal-append.md

### wal-batch-adds-commit-record [IN] OBSERVATION
`append_batch` of N items writes N+1 records to the WAL: the N data records plus a trailing record with `op_type == "COMMIT"` that seals the batch
- Source: entries/2026/05/28/write-ahead-log-test_wal.md

### wal-batch-always-fsyncs [IN] OBSERVATION
`append_batch()` always force-fsyncs regardless of the configured sync mode, because batch atomicity requires the COMMIT marker to be durable before returning.
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-batch-atomicity-via-commit-record [IN] OBSERVATION
A batch write is only considered complete if its trailing COMMIT record is present and passes CRC; incomplete batches (missing commit) are discarded during replay
- Source: entries/2026/05/28/topic-lsm-crash-recovery-wal-format.md

### wal-batch-bug-is-performance-not-correctness [IN] OBSERVATION
The stale `_write_count` after forced syncs causes extra fsyncs (premature batch threshold triggers) but never causes data loss — forced syncs always flush to disk regardless of counter state.
- Source: entries/2026/05/29/topic-write-count-reset-semantics.md

### wal-batch-commit-sentinel [IN] OBSERVATION
Every `append_batch` call writes exactly one COMMIT record (op code 3) as the final record in the batch, with empty key and value fields, consuming one additional sequence number.
- Source: entries/2026/05/28/write-ahead-log-wal-append_batch.md

### wal-batch-mode-loses-up-to-n-records [IN] OBSERVATION
In WAL batch sync mode, up to `batch_sync_count - 1` records (default 99) can be lost on `kill -9` because `os.fsync()` is deferred until the batch threshold is reached; only `append_batch()` force-fsyncs regardless of mode.
- Source: entries/2026/05/29/topic-crash-recovery-invariants.md

### wal-batch-relies-on-practical-atomicity [IN] OBSERVATION
`append_batch()` buffers all operations plus COMMIT into a single `write()` call to minimize the partial-write window, but the code acknowledges this is not a true atomicity guarantee — a crash during the write can persist a prefix without the COMMIT marker
- Source: entries/2026/05/28/topic-wal-atomicity-guarantees.md

### wal-batch-single-file [IN] OBSERVATION
`append_batch` writes the entire batch buffer in one `_fd.write` call before `_maybe_rotate`, so a batch never spans two WAL files under normal operation
- Source: entries/2026/05/29/topic-wal-recovery-semantics.md

### wal-batch-single-write [IN] OBSERVATION
`append_batch()` serializes all data records plus the trailing COMMIT marker into one `bytearray` and issues a single `fd.write()` call, relying on OS write atomicity for small batches
- Source: entries/2026/05/28/topic-wal-commit-semantics.md

### wal-batch-sync-count-controls-durability-window [IN] OBSERVATION
In batch mode, up to `batch_sync_count - 1` records may be lost on crash because they haven't been fsynced yet.
- Source: entries/2026/05/29/write-ahead-log-wal-_do_sync.md

### wal-batch-sync-force-override [IN] OBSERVATION
The WAL's `_do_sync` batch mode (fsync every N writes) is overridden by `force=True` in `append_batch` and `checkpoint`, ensuring atomic batch boundaries and checkpoint records are always fsynced regardless of configured sync mode.
- Source: entries/2026/05/29/topic-bitcask-durability-tradeoffs.md

### wal-binary-format-prevents-evolution-and-recovery [IN] OBSERVATION
The WAL binary format is simultaneously inflexible and fragile: contiguous record packing with no block alignment prevents resync after mid-file corruption, signed 32-bit length fields theoretically admit negative values with no guard, and there is no version field — a format change invalidates all existing WAL files with no migration path and no way to distinguish old-format from new-format records.

### wal-checkpoint-consumes-sequence [IN] OBSERVATION
`checkpoint()` increments the WAL sequence counter by 1, occupying a position in the sequence number space alongside data records (e.g., after 6 data records at seq 1-6, `current_seq_num()` is 7 and `checkpoint()` returns 8).
- Source: entries/2026/05/28/write-ahead-log-tester_test_wal.md

### wal-checkpoint-forces-fsync [IN] OBSERVATION
`checkpoint()` calls `_do_sync(force=True)`, making the checkpoint marker durable on disk regardless of the configured sync mode, so recovery boundaries are never lost even in `"none"` or `"batch"` mode.
- Source: entries/2026/05/29/topic-fsync-durability-tradeoffs.md

### wal-checkpoint-forces-sync [IN] OBSERVATION
`checkpoint()` calls `_do_sync(force=True)`, making checkpoint markers always durable on disk regardless of the configured sync mode
- Source: entries/2026/05/29/topic-bitcask-crash-recovery-semantics.md

### wal-checkpoint-returns-seq [IN] OBSERVATION
`WriteAheadLog.checkpoint()` returns the sequence number it wrote, providing the caller the exact truncation boundary for the WAL-to-data-store coordination protocol
- Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

### wal-checkpoint-seq-is-next-after-data [IN] OBSERVATION
Checkpoint markers consume a sequence number in the same monotonic space as data records — a checkpoint after 7 data records occupies seq=8
- Source: entries/2026/05/28/write-ahead-log-test_wal.md

### wal-commit-clears-all-entries [IN] OBSERVATION
`WAL.commit` clears the entire WAL unconditionally via truncate; there is no partial commit or transaction grouping within a single WAL file
- Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

### wal-commit-sync-before-truncate [IN] OBSERVATION
`WAL.commit` always fsyncs the data file before truncating the WAL file; reversing this order would create a crash-safety hole where committed data could be lost
- Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

### wal-commit-syncs-metadata-implicitly [IN] OBSERVATION
`WAL.commit()` in the B-tree engine syncs metadata page 0 because `PageManager.sync()` fsyncs the single shared file descriptor that holds both data pages and metadata; there is no dedicated metadata fsync step
- Source: entries/2026/05/29/topic-wal-commit-vs-metadata-sync-ordering.md

### wal-concurrent-write-during-read-can-trigger-corruption-stop [IN] OBSERVATION
A concurrent `append()` writing to a file the iterator is reading can produce a partial record that `_read_record` interprets as CRC failure, terminating the entire iterator and dropping all remaining valid records across subsequent files
- Source: entries/2026/05/29/topic-wal-concurrency-safety.md

### wal-contiguous-no-block-alignment [IN] OBSERVATION
None of the WAL implementations use block-aligned or page-aligned record layouts; all records are packed contiguously with variable-length encoding, so torn writes can land at arbitrary byte offsets
- Source: entries/2026/05/29/topic-block-aligned-wal-records.md

### wal-corruption-returns-valid-prefix [IN] OBSERVATION
When the WAL encounters a corrupted record during replay, it stops and returns all structurally valid records preceding the corruption point rather than raising an exception to the caller.
- Source: entries/2026/05/28/write-ahead-log-tester_test_wal.md

### wal-corruption-stops-per-file [IN] OBSERVATION
`_recover_seq_num` stops at the first corrupted record within each file but continues to subsequent WAL files, so cross-file recovery is resilient but intra-file records after corruption are silently lost
- Source: entries/2026/05/29/topic-wal-recovery-semantics.md

### wal-crc-does-not-cover-seqnum [IN] OBSERVATION
The CRC32 checksum covers only `op_type + key + value`; a corrupted sequence number would not be detected by integrity checking.
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-crc-includes-op-type [IN] OBSERVATION
The standalone WAL uniquely includes `op_type_byte` in its CRC input alongside key and value, protecting against silent operation-type corruption (e.g., PUT flipped to DELETE) that the other CRC implementations leave undetected
- Source: entries/2026/05/29/topic-crc-scope-comparison-across-implementations.md

### wal-crc-mismatch-halts-all-replay [IN] OBSERVATION
`_read_all_records` catches CRC `ValueError` and stops iteration entirely (returns, not continues), so corruption in any file aborts reading of all subsequent files too.
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-default-segment-10mb [IN] OBSERVATION
WAL and hash-index Bitcask both default `max_file_size` to 10 MB, while the log-structured Bitcask defaults to 1 MB — reflecting the latter's heavier reliance on compaction with hint files to offset recovery cost.
- Source: entries/2026/05/29/topic-wal-segment-sizing-tradeoffs.md

### wal-default-sync-mode-is-sync [IN] OBSERVATION
`WriteAheadLog.__init__` defaults `sync_mode` to `"sync"`, making per-write fsync the safe default that callers must explicitly opt out of for higher throughput
- Source: entries/2026/05/29/topic-sync-mode-none-safety.md

### wal-do-sync-branches-mutually-exclusive [IN] OBSERVATION
The `_do_sync` method's `if`/`elif` structure means forced syncs in batch mode skip all counter logic entirely — the counter is neither incremented nor reset, which is the structural root cause of the premature batch-sync bug.
- Source: entries/2026/05/29/topic-write-count-reset-semantics.md

### wal-do-sync-requires-caller-lock [IN] OBSERVATION
`_do_sync` does not acquire `self._lock`; thread safety depends on every caller holding the lock before invoking it.
- Source: entries/2026/05/29/write-ahead-log-wal-_do_sync.md

### wal-docstring-describes-intent-not-behavior [IN] OBSERVATION
The `replay()` docstring claims it "skips uncommitted batches" (describing the DDIA concept), but the inline comments and implementation show it returns all PUT/DELETE records regardless of COMMIT presence; the docstring is aspirational, the comments are accurate
- Source: entries/2026/05/29/topic-wal-batch-atomicity-gap.md

### wal-durability-tiers-lost-during-recovery [IN] OBSERVATION
The WAL's carefully calibrated write-time durability tiers (per-write fsync in sync mode vs. batch-only fsync) are completely invisible to recovery: sequence numbers and checkpoints that could distinguish between definitely-durable and potentially-lost records are never consulted, so replay treats all CRC-valid records identically regardless of whether they were fsynced — the write-time investment in durability is wasted.

### wal-encode-is-pure [IN] OBSERVATION
`_encode_record` is a pure function that performs no I/O or state mutation; all disk writes (and fsync) are the responsibility of callers (`append`, `append_batch`, `checkpoint`, `truncate`).
- Source: entries/2026/05/28/write-ahead-log-wal-_encode_record.md

### wal-eof-advances-to-next-file [IN] OBSERVATION
An EOF or partial read within a single WAL file causes `_read_all_records` to advance to the next file rather than terminate the entire iterator
- Source: entries/2026/05/29/write-ahead-log-wal-_read_all_records.md

### wal-exactly-at-limit-rotates [IN] OBSERVATION
A WAL file whose size equals `_max_file_size` triggers rotation in `_open_latest`; only files strictly smaller are reused.
- Source: entries/2026/05/29/write-ahead-log-wal-_open_latest.md

### wal-file-size-managed-without-syscall-overhead [IN] OBSERVATION
WAL file size management avoids filesystem syscall overhead through tell()-based tracking with soft-limit semantics: the hot path uses fd.tell() instead of os.path.getsize() to avoid stat calls and TOCTOU races, the size limit is a soft cap checked before the next write rather than mid-write, and files at exactly the limit trigger rotation to prevent unbounded growth while allowing single-record overshoot.

### wal-files-sorted-lexicographic [IN] OBSERVATION
`_wal_files()` returns segment files sorted by filename (`{N:06d}.wal`), which equals chronological order since segment numbers increment monotonically
- Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

### wal-flush-truncate-replay-safety [IN] OBSERVATION
If a crash occurs between SSTable flush and WAL truncation, replay re-inserts already-persisted entries into the memtable, which is safe because memtable values shadow SSTable values during reads (last-writer-wins)
- Source: entries/2026/05/29/log-structured-merge-tree-lsm-replay.md

### wal-format-change-breaks-compatibility [IN] OBSERVATION
Changing the CRC input in `_encode_record` invalidates all existing WAL files; a production deployment would require a version byte and dual-path CRC verification during migration
- Source: entries/2026/05/29/topic-seq-num-integrity-gap.md

### wal-fsync-per-entry [IN] OBSERVATION
Each `log_write` call forces an `os.fsync`, guaranteeing per-entry durability at the cost of one sync syscall per logged page write.
- Source: entries/2026/05/29/b-tree-storage-engine-btree-log_write.md

### wal-has-no-before-images [IN] OBSERVATION
WALRecord stores only `key` and `value` (the new value) with no `old_value` field, making undo structurally impossible from the log alone
- Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

### wal-has-threading-lock [IN] OBSERVATION
`WriteAheadLog` in `wal.py:80` uses a `threading.Lock` to serialize `append`, `append_batch`, `checkpoint`, and `truncate`, making it the only concurrency-aware component in the codebase
- Source: entries/2026/05/29/topic-btree-concurrency-control.md

### wal-has-truncation-raft-does-not [IN] OBSERVATION
The WAL module provides explicit `truncate(up_to_seq)` and file rotation for log lifecycle management; the Raft module has no equivalent despite both being append-only log abstractions
- Source: entries/2026/05/29/topic-raft-log-compaction.md

### wal-hot-path-avoids-getsize [IN] OBSERVATION
`_maybe_rotate` uses `self._fd.tell()` instead of `os.path.getsize()`, avoiding filesystem stat calls and TOCTOU races on every append
- Source: entries/2026/05/29/topic-wal-size-check-toctou.md

### wal-is-redo-only [IN] OBSERVATION
The B-tree WAL uses redo-only recovery (replay logged page images forward); there is no undo log, so incomplete pre-commit writes in the data file are overwritten by WAL replay to restore consistency
- Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

### wal-is-source-of-truth [IN] OBSERVATION
The unbundled database's `StorageEngine` is fully derivable from the `WriteAheadLog`; calling `rebuild()` replays the entire WAL and reproduces identical state, making the log the authoritative record
- Source: entries/2026/05/29/topic-ddia-ch12-unbundling.md

### wal-iterate-preserves-all-ops [IN] OBSERVATION
`iterate()` yields every record including COMMIT and CHECKPOINT markers, providing the raw stream needed for commit-aware recovery logic built on top of `replay()`
- Source: entries/2026/05/28/topic-wal-commit-semantics.md

### wal-key-value-length-signed-int32 [IN] OBSERVATION
Key and value lengths in the WAL binary format are packed as signed 32-bit integers (`struct` format `<i`), limiting each field to ~2 GB and leaving negative lengths unguarded.
- Source: entries/2026/05/28/write-ahead-log-wal-_encode_record.md

### wal-little-endian-hardcoded [IN] OBSERVATION
The WAL binary record format uses little-endian byte order unconditionally (struct prefix `<`), making log files non-portable across architectures with different endianness.
- Source: entries/2026/05/28/write-ahead-log-wal-_encode_record.md

### wal-max-file-size-is-soft [IN] OBSERVATION
`max_file_size` is a soft limit; a single batch can push a WAL file arbitrarily past it because `_maybe_rotate` runs only after the write and sync complete
- Source: entries/2026/05/29/topic-batch-atomicity-across-rotation.md

### wal-max-file-size-is-soft-limit [IN] OBSERVATION
`_max_file_size` is a soft cap: the check triggers rotation for the *next* write, it does not prevent the current write from exceeding the limit.
- Source: entries/2026/05/29/write-ahead-log-wal-_maybe_rotate.md

### wal-module-uses-locking-lsm-does-not [IN] OBSERVATION
The WAL module in `write-ahead-log/wal.py` uses `self._lock` for thread safety across its mutation paths, while the LSM tree module has zero locking or concurrency control despite having the same concurrent-access risks
- Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

### wal-no-begin-marker [IN] OBSERVATION
The WAL protocol has no BEGIN record type, making it impossible during replay to distinguish a standalone PUT from the first record of a multi-record batch — the structural root cause of why COMMIT markers cannot enforce atomicity.
- Source: entries/2026/05/29/topic-incomplete-batch-recovery.md

### wal-no-commit-method [IN] OBSERVATION
The standalone `WriteAheadLog` class in `write-ahead-log/wal.py` has no `commit` method; sync-then-truncate coordination must be implemented by a higher-level storage engine that composes the WAL with a data file
- Source: entries/2026/05/29/topic-wal-commit-protocol.md

### wal-no-concurrent-group-commit [IN] OBSERVATION
The WAL uses a single `threading.Lock` and a write counter for batch sync rather than a concurrent waiter queue, so it cannot amortize fsync cost across concurrent callers the way PostgreSQL's group commit does.
- Source: entries/2026/05/29/topic-group-commit-optimization.md

### wal-no-directory-fsync [IN] OBSERVATION
The WAL implementation never fsyncs the parent directory after segment creation or deletion, meaning file metadata changes (including unlinks) are not guaranteed durable on crash on Linux
- Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

### wal-no-gap-detection [IN] OBSERVATION
Neither the WAL reader nor any visible consumer checks for sequence number gaps after replay, making silent data loss from mid-file corruption undetectable
- Source: entries/2026/05/29/topic-wal-recovery-semantics.md

### wal-no-nested-batches [IN] OBSERVATION
The WAL format has no batch-start marker and no nesting support; the lock held during `append_batch` prevents interleaving, making nested or concurrent batches structurally impossible
- Source: entries/2026/05/29/topic-commit-aware-replay-design.md

### wal-no-resync-after-corruption [IN] OBSERVATION
A corrupted `record_length` in the WAL causes `_read_record` to misframe all subsequent records; there is no magic-byte or scan-forward recovery mechanism to re-synchronize
- Source: entries/2026/05/29/topic-wal-record-format-evolution.md

### wal-no-rollback-on-io-failure [IN] OBSERVATION
If `write()` or `fsync()` fails mid-batch in `append_batch`, sequence numbers are already incremented with no rollback mechanism, creating a permanent gap in the sequence space.
- Source: entries/2026/05/28/write-ahead-log-wal-append_batch.md

### wal-no-torn-write-tests [IN] OBSERVATION
The WAL module has no tests for truncated records, CRC mismatches, or partial writes; the B-tree module tests CRC corruption explicitly but the WAL does not
- Source: entries/2026/05/29/topic-partial-write-detection.md

### wal-not-imported-outside-tests [IN] OBSERVATION
No production module imports the standalone `WriteAheadLog`; it is only referenced in `test_wal.py` and `tester_test_wal.py`, meaning crash-safe commit coordination using this module is unimplemented
- Source: entries/2026/05/29/topic-wal-commit-protocol.md

### wal-open-latest-init-only [IN] OBSERVATION
`_open_latest` is called exclusively from `__init__`, so the TOCTOU window between `_wal_files()` and `os.path.getsize()` cannot be hit by concurrent WAL operations
- Source: entries/2026/05/29/topic-wal-size-check-toctou.md

### wal-partial-read-is-eof [IN] OBSERVATION
`_read_record` returns `None` on partial/short reads (torn writes), treated as EOF; this is distinct from CRC mismatch which raises `ValueError` — the distinction separates "crash during write" from "data corruption."
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-partial-truncate-idempotent-replay [IN] OBSERVATION
If the standalone WAL's `truncate()` crashes mid-iteration over files, un-processed files retain old records that will be replayed on recovery; this is safe because PUT and DELETE are idempotent against an already-current store
- Source: entries/2026/05/29/topic-wal-crash-safety-gap.md

### wal-protects-data-not-metadata [IN] OBSERVATION
The WAL protects user key-value writes but does not record SSTable-level state transitions (flush, compaction, level assignment), leaving metadata changes unprotected across crashes
- Source: entries/2026/05/28/topic-manifest-based-compaction.md

### wal-provides-crash-safety-not-concurrency [IN] OBSERVATION
The WAL syncs with `os.fsync` for durability guarantees only; it has no role in coordinating concurrent access, consistent with the single-threaded design
- Source: entries/2026/05/28/topic-concurrent-btree-deletion.md

### wal-read-record-stateless [IN] OBSERVATION
`_read_record` is a module-level function with no dependency on `WriteAheadLog` instance state, enabling its use during WAL construction/recovery before the object is fully initialized.
- Source: entries/2026/05/28/write-ahead-log-wal-_read_record.md

### wal-record-format-length-prefixed [IN] OBSERVATION
Each WAL record on disk is prefixed with a 4-byte little-endian uint32 length covering everything after itself, allowing the reader to atomically skip or validate entire records.
- Source: entries/2026/05/28/write-ahead-log-wal-_read_record.md

### wal-record-length-excludes-own-prefix [IN] OBSERVATION
The `record_length` field counts 21 + len(key) + len(value) bytes — everything after the 4-byte length prefix itself — so total on-disk size per record is `record_length + 4`.
- Source: entries/2026/05/28/write-ahead-log-wal-_encode_record.md

### wal-record-length-prefixed [IN] OBSERVATION
Every WAL record is prefixed with a 4-byte little-endian length covering all subsequent fields (CRC through value), enabling the reader to know exactly how many bytes to consume
- Source: entries/2026/05/28/topic-lsm-crash-recovery-wal-format.md

### wal-recover-on-every-startup [IN] OBSERVATION
`BTree.__init__` calls `WAL.recover()` unconditionally on every startup; an empty WAL short-circuits after a single read with no further I/O
- Source: entries/2026/05/29/b-tree-storage-engine-btree-recover.md

### wal-recover-seq-vs-read-all-differ [IN] OBSERVATION
`_recover_seq_num` uses `break` on corruption (continues to next file) while `_read_all_records` uses `return` (stops everything) — recovery is lenient to maximize the sequence counter, replay is strict to avoid replaying past corruption
- Source: entries/2026/05/29/write-ahead-log-wal-_read_all_records.md

### wal-recover-then-truncate [IN] OBSERVATION
WAL recovery fsyncs the data file via `page_manager.sync()` before truncating the WAL, so a crash during recovery itself leaves the WAL intact for re-replay
- Source: entries/2026/05/29/b-tree-storage-engine-btree-recover.md

### wal-recovery-contract-is-valid-prefix [IN] OBSERVATION
The WAL's recovery model guarantees a valid-prefix contract: both corruption (CRC mismatch halts replay returning prior valid records) and partial writes (treated as EOF, not error) cause replay to stop cleanly, always returning a consistent prefix of the written log.

### wal-recovery-depends-on-listdir [IN] OBSERVATION
WAL segment discovery during recovery uses `os.listdir()`, so any segment whose parent directory entry was not fsynced becomes invisible after a crash — connecting the dir-fsync gap to concrete data loss.
- Source: entries/2026/05/29/topic-fsync-on-new-file-creation.md

### wal-recovery-infrastructure-is-vestigial [IN] OBSERVATION
The WAL carries vestigial recovery-related infrastructure: sequence numbers are monotonically assigned under lock and stored in every record but never consulted for ordering, deduplication, or gap detection during recovery (file position alone determines replay order), and checkpoint records occupy positions in the same sequence space but are neither searched for during recovery scanning nor used as truncation watermarks.

### wal-recovery-scans-full-file [IN] OBSERVATION
`_recover_seq_num` in `write-ahead-log/wal.py` reads every record in every WAL file sequentially from byte zero; there is no seek-to-end or block-skip optimization, making recovery O(file-size)
- Source: entries/2026/05/29/topic-block-aligned-wal-records.md

### wal-replay-ignores-commit [IN] OBSERVATION
`replay()` returns all PUT/DELETE records regardless of whether they are followed by a COMMIT record, meaning uncommitted batches are replayed identically to committed ones.
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-replay-no-atomicity-check [IN] OBSERVATION
`replay()` does not verify that batch operations have a corresponding COMMIT record; partial batches from a mid-write crash would be replayed as if committed, since replay filters only by record type, not by batch completeness
- Source: entries/2026/05/28/topic-wal-commit-semantics.md

### wal-rotate-fsync-before-close [IN] OBSERVATION
`_rotate` always fsyncs the outgoing segment before closing it, ensuring no buffered writes are lost during segment rotation.
- Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

### wal-rotate-no-gap-protection [IN] OBSERVATION
If a segment file is manually deleted from the directory, `_rotate` can produce a filename that reuses a previously-used number, since it derives the next name solely from the current highest filename.
- Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

### wal-rotation-is-post-write [IN] OBSERVATION
Both `append` and `append_batch` call `_maybe_rotate()` after write and sync complete; rotation never interrupts an in-progress write, and the file only rotates on the next operation
- Source: entries/2026/05/29/topic-batch-atomicity-across-rotation.md

### wal-rotation-monotonicity-untested [IN] OBSERVATION
`test_rotation` asserts record count after rotation but does not verify that sequence numbers are monotonically increasing across file boundaries — the cross-file monotonicity invariant holds by construction but has no regression test.
- Source: entries/2026/05/29/topic-multi-file-replay-ordering.md

### wal-segment-naming-zero-padded [IN] OBSERVATION
WAL segment filenames are zero-padded 6-digit integers (`000001.wal`, `000002.wal`, ...) derived from the highest existing filename plus one, which makes lexicographic sort equal numeric sort.
- Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

### wal-seq-num-global-monotonic [IN] OBSERVATION
`_seq_num` is a single in-memory counter that only increments via `+= 1`; all records across all WAL files share one monotonically increasing sequence space that never resets, even across file rotation.
- Source: entries/2026/05/29/topic-multi-file-replay-ordering.md

### wal-seq-num-is-monotonic-not-page-lsn [IN] OBSERVATION
WAL `seq_num` increases monotonically across all records but is not stamped onto data pages, so there is no mechanism to detect whether a replayed operation was already applied — making replay non-idempotent against the underlying store.
- Source: entries/2026/05/29/topic-redo-vs-undo-logging.md

### wal-seq-num-recovered-on-init [IN] OBSERVATION
On construction, `_recover_seq_num()` scans all WAL files to find the maximum surviving sequence number, guaranteeing the monotonic counter never goes backward across crashes
- Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

### wal-seq-nums-are-vestigial [IN] OBSERVATION
WAL sequence numbers are monotonic and strictly increasing but entirely vestigial: they are computed under lock, stored in every record, and parsed during recovery, yet never consulted for ordering, deduplication, or gap detection — file position alone determines replay order.

### wal-seq-nums-strictly-monotonic [IN] OBSERVATION
Sequence numbers increment under a threading.Lock and are never reused; on recovery, all WAL files are scanned to find the high-water mark so new records continue the sequence.
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-seq-reset-on-commit [IN] OBSERVATION
The WAL sequence counter resets to 0 on every commit; sequence numbers are only meaningful within a single uncommitted transaction window
- Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

### wal-seq-starts-at-one [IN] OBSERVATION
Sequence numbers begin at 1 for the first appended record; an empty WAL reports `current_seq_num() == 0`
- Source: entries/2026/05/28/write-ahead-log-test_wal.md

### wal-seq-unused-in-recovery [IN] OBSERVATION
The 4-byte sequence number field in each WAL entry is parsed during `recover()` but never consulted; replay order is determined solely by file offset
- Source: entries/2026/05/29/b-tree-storage-engine-btree-recover.md

### wal-single-writer-fd [IN] OBSERVATION
At most one file descriptor is open for WAL writes at any time; `_rotate` closes the old fd before opening the new one.
- Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

### wal-single-writer-thread-level [IN] OBSERVATION
The WAL enforces single-writer via `threading.Lock` but has no inter-process locking mechanism (no flock/PID file), so the single-writer invariant holds only within a single OS process
- Source: entries/2026/05/29/topic-wal-size-check-toctou.md

### wal-size-check-uses-disk [IN] OBSERVATION
`_open_latest` checks on-disk file size via `os.path.getsize`, not an in-memory counter, making it correct across crash/restart boundaries.
- Source: entries/2026/05/29/write-ahead-log-wal-_open_latest.md

### wal-sync-mode-default-is-sync [IN] OBSERVATION
`WriteAheadLog` defaults to `sync_mode="sync"`, calling `flush()` + `os.fsync()` after every single `append` call — the safest and slowest mode.
- Source: entries/2026/05/29/topic-wal-sync-modes.md

### wal-sync-mode-survives-crash [IN] OBSERVATION
With `sync_mode="sync"`, WAL records are durable immediately after `append` returns — they survive process death even if `close()` is never called.
- Source: entries/2026/05/28/write-ahead-log-tester_test_wal.md

### wal-sync-mode-unvalidated [IN] OBSERVATION
`WriteAheadLog` accepts any string as `sync_mode` without validation; values other than `"sync"` and `"batch"` silently disable all fsync.
- Source: entries/2026/05/29/write-ahead-log-wal-_do_sync.md

### wal-thread-safety-stranded-by-callers [IN] OBSERVATION
The WAL carefully serializes all mutations (append, append_batch, checkpoint, truncate) under a threading.Lock, but both callers that depend on it have no synchronization for their own shared state: the LSM tree mutates _memtable, _sstables, and _immutable_memtables without any locking or atomic swaps, and _sstables specifically is mutated by both flush (append) and compact (full replacement) concurrently — the lock protects the log but not the data structures built from it.

### wal-truncate-blocks-all-operations [IN] OBSERVATION
`truncate` holds `self._lock` for its entire duration (close, rewrite, reopen), blocking all concurrent appends, replays, and iterates
- Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

### wal-truncate-closes-write-fd [IN] OBSERVATION
`truncate()` flushes, fsyncs, and closes the current write file descriptor before iterating segments, preventing conflicts with files it may need to delete or rewrite
- Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

### wal-truncate-deletes-empty-files [IN] OBSERVATION
WAL files where every record has `seq_num <= up_to_seq` are deleted entirely via `os.remove` rather than left as empty files on disk
- Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

### wal-truncate-deletes-oldest-first [IN] OBSERVATION
`truncate()` iterates segments via `_wal_files()` in oldest-first order, so a crash mid-truncation leaves a contiguous suffix of segments — preserving the recovery invariant that surviving files form a continuous sequence
- Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

### wal-truncate-drops-after-corruption [IN] OBSERVATION
During truncate, if a corrupt record is encountered in a WAL file, all subsequent records in that file are silently discarded regardless of their sequence number — even records with `seq_num > up_to_seq`
- Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

### wal-truncate-fsyncs-before-scan [IN] OBSERVATION
`truncate()` flushes and fsyncs the current WAL file before scanning for records to remove, preventing data loss from buffered writes that haven't reached disk
- Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

### wal-truncate-inclusive-boundary [IN] OBSERVATION
`truncate(n)` removes records with `seq_num <= n` (inclusive upper bound), keeping only records strictly greater than `n`
- Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

### wal-truncate-is-inclusive [IN] OBSERVATION
`truncate(seq)` removes all records with sequence number less than or equal to `seq`, keeping only records where `seq_num > seq`
- Source: entries/2026/05/28/write-ahead-log-test_wal.md

### wal-truncate-no-bounds-check [IN] OBSERVATION
`truncate` does not validate that `up_to_seq` is within range — passing a value beyond `current_seq_num()` silently deletes all records without error
- Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

### wal-truncate-not-crash-safe [IN] OBSERVATION
`truncate()` rewrites WAL files in place without atomic rename, so a crash during truncation can leave the log in an inconsistent state.
- Source: entries/2026/05/28/write-ahead-log-wal.md

### wal-truncate-preserves-records-above-seq [IN] OBSERVATION
`WriteAheadLog.truncate(up_to_seq)` keeps records with `seq_num > up_to_seq` and deletes only those at or below, enabling partial log reclamation tied to checkpoint boundaries.
- Source: entries/2026/05/29/topic-wal-checkpoint-protocol.md

### wal-truncate-requires-explicit-seq [IN] OBSERVATION
`truncate(up_to_seq)` takes an explicit sequence number parameter; the WAL never decides what to truncate on its own — the caller must provide the boundary
- Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

### wal-truncate-requires-prior-checkpoint [IN] OBSERVATION
WAL truncation is only safe because it is called after the data the WAL protects has been durably written to the main store (SSTable flush or data file fsync); violating this ordering loses committed data with no recovery path
- Source: entries/2026/05/29/topic-wal-crash-safety-gap.md

### wal-truncate-rewrites-files [IN] OBSERVATION
`WriteAheadLog.truncate()` reads and rewrites every segment file record-by-record rather than deleting whole segment files, making it O(total records) instead of O(segments)
- Source: entries/2026/05/29/topic-log-compaction-vs-persistence.md

### wal-truncation-is-multi-failure-hazard [IN] OBSERVATION
WAL truncation combines three independent failure modes: it blocks all concurrent operations for its entire duration, silently discards all records after encountering corruption in a file, and is not crash-safe due to in-place file rewriting without atomic rename.

### wal-truncation-vs-corruption-distinction [IN] OBSERVATION
`_read_record` returns `None` for short reads (truncation) but raises `ValueError` for CRC mismatch (corruption), giving callers two distinct failure modes to handle differently during recovery
- Source: entries/2026/05/29/topic-partial-write-detection.md

### wal-two-tier-durability-model [IN] OBSERVATION
The WAL provides two distinct durability levels: individual appends respect the configured sync mode (potentially skipping fsync entirely), while batch operations always force fsync, meaning batch writes are strictly more durable than individual writes regardless of configuration.

### wal-uses-crc32-not-sha [IN] OBSERVATION
WAL integrity uses `zlib.crc32` (32-bit, non-cryptographic); it detects accidental corruption but not intentional tampering
- Source: entries/2026/05/28/topic-wal-checksum-format.md

### write-meta-no-fsync [IN] OBSERVATION
`PageManager._write_meta` calls `flush()` but not `os.fsync()`, so metadata updates (`next_free_page`, `free_list_head`) are not durable against power loss — compounding the WAL bypass with a durability gap on the direct write path.
- Source: entries/2026/05/29/topic-free-list-corruption-risks.md

### write-page-silently-truncates [IN] OBSERVATION
`PageManager.write_page` truncates data exceeding `page_size` to exactly `page_size` bytes without raising an error, which would corrupt the node header's `num_keys` count
- Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

### write-skew-has-no-default-tests [IN] OBSERVATION
The `write-skew-detection` module's tests exist only in `tester_test_ssi.py`, making it entirely untested under default `pytest` invocation since that filename does not match the `test_*.py` / `*_test.py` globs
- Source: entries/2026/05/29/topic-pytest-default-collection.md

### write-time-durability-engineering-abandoned-at-recovery [IN] OBSERVATION
The WAL's carefully engineered write-time durability infrastructure is systematically abandoned at recovery time: the two-tier durability model (per-write sync vs. batch-only fsync) loses all distinction during replay because recovery ignores tiers entirely, AND crash recovery is both broken (no safe path across any implementation) and unverified (no crash or async tests), meaning the engineering investment in write-time safety provides zero value when it is most needed.

### wrong-digest-cannot-reach-quorum [IN] OBSERVATION
PREPARE and COMMIT messages with non-matching digests are silently dropped and never count toward quorum thresholds; a Byzantine node sending bad digests cannot contribute to agreement
- Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

### wrong-digest-is-node-specific [IN] OBSERVATION
`WRONG_DIGEST` mode produces `"bad_digest_{node_id}"`, so two Byzantine nodes with this mode produce different invalid digests rather than accidentally colluding on the same forged value
- Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-_apply_byzantine.md

### zero-entries-stripped [IN] OBSERVATION
`VectorClock.__init__` strips all entries with value 0, ensuring two clocks with identical non-zero entries are always equal and hash-equal.
- Source: entries/2026/05/29/vector-clocks-vector_clock.md

### zip-truncation-risk [IN] OBSERVATION
If `keys` and `values` lists passed to `_serialize_leaf` have mismatched lengths, `zip` silently truncates to the shorter list while `num_keys` in the header reflects the longer, producing a corrupt page
- Source: entries/2026/05/28/b-tree-storage-engine-btree-_serialize_leaf.md

### zlib-crc32-is-iso3309 [IN] OBSERVATION
All 13 CRC call sites across the codebase use `zlib.crc32`, which implements the ISO 3309 polynomial (0xEDB88320); no module uses the Castagnoli polynomial (CRC-32C) that RocksDB and PostgreSQL prefer for hardware-accelerated checksumming.
- Source: entries/2026/05/29/topic-crc32-vs-crc32c.md

### zlib-crc32-supports-chaining-natively [IN] OBSERVATION
Python's `zlib.crc32(data, initial_value)` accepts an initial CRC value, meaning chained checksums could be implemented in this codebase by passing the previous frame's CRC as the seed with no additional hashing infrastructure
- Source: entries/2026/05/29/topic-cumulative-checksums.md

### 2pc-recovery-reaches-terminal-state [OUT] OBSERVATION
Two-phase commit recovery drives all interrupted transactions to a terminal state by replaying committed decisions to participants, with lock ownership guards preventing cross-transaction interference during the recovery process.

### avro-schema-evolution-handles-version-mismatch [OUT] OBSERVATION
Avro schema evolution correctly handles writer-reader version mismatches through mandatory dual-schema resolution (writer schema parses wire bytes, reader schema shapes output), with clean separation between structural schema errors at parse time and compatibility errors at resolution time.

### bitcask-compaction-preserves-state [OUT] OBSERVATION
Bitcask compaction maintains identical observable behavior: every key returns the same value before and after compaction.

### bloom-filter-sizing-is-production-ready [OUT] OBSERVATION
The Bloom filter implementation uses textbook-optimal bit array and hash count formulas, making it ready for production integration.

### btree-page-alignment-enables-corruption-recovery [OUT] OBSERVATION
Fixed-size page addressing provides natural resync boundaries for recovering from data file corruption.

### btree-range-scan-provides-ordered-access [OUT] OBSERVATION
B-tree range scans provide correct ordered sequential access by walking the actively maintained sibling chain with support for unbounded end keys, traversing all matching keys in sorted order.

### btree-splits-are-crash-safe [OUT] OBSERVATION
Multi-page B-tree operations (splits, deletes) are made crash-safe by writing all modifications to the WAL before applying them to data pages.

### consistent-hash-routing-correct-under-single-thread [OUT] OBSERVATION
Consistent hash routing provides correct minimal-redistribution key assignment with proper deduplication: adding an Nth node moves approximately 1/N of keys (not a full reshuffle), and the preference list correctly skips virtual nodes of already-seen physical nodes to return exactly replication_factor distinct physical nodes.

### crdt-convergence-practically-sustainable [OUT] OBSERVATION
CRDT merge convergence is both algebraically correct and practically sustainable in production with bounded resource consumption.

### crdt-merge-algebra-satisfies-convergence-requirements [OUT] OBSERVATION
All four CRDT types satisfy the algebraic properties required for strong eventual convergence: merge is idempotent (re-merging produces no change), equality compares semantic state rather than object identity (enabling correct convergence checks), and ORSet tombstones grow monotonically (preventing element resurrection after removal).

### derived-systems-maintain-consistency-when-position-durable [OUT] OBSERVATION
Derived systems (secondary indexes, materialized views, projections) can maintain consistency with their source through position-tracked CDC replay: every derived system tracks how far through the CDC log it has processed, and any derived system can be rebuilt from scratch via full event replay producing identical state to incremental processing.

### distributed-layer-has-three-incompatible-convergence-models [OUT] OBSERVATION
The distributed layer uses three fundamentally incompatible convergence and resolution models with no unifying bridge: CRDTs encode resolution algebraically in merge semantics, the strategy pattern selects between LWW and custom resolution at runtime, AND consensus and membership use irreconcilable strong-leader (Raft) vs. eventual-consistency (gossip) models — composing correct end-to-end behavior requires manually bridging paradigms designed in isolation.

### distributed-layer-incoherent-and-unachievable [OUT] OBSERVATION
The distributed layer is both internally incoherent and externally unachievable: convergence and ordering models are mutually incompatible across modules (CRDTs encode algebraic resolution, LWW uses wall-clock tiebreaking, gossip uses epidemic thresholds) AND correctness is doubly unachievable under network partitions (storage-layer guarantees are unmet, write correctness gaps are amplified by partition-induced failure detection breakdowns).

### distributed-models-incompatible-at-convergence-and-ordering [OUT] OBSERVATION
The distributed layer has mutually incompatible models at two independent levels: convergence mechanisms (CRDTs encode algebraic resolution, LWW uses timestamp comparison, Raft requires strong leader authority, gossip relies on epidemic propagation) have no unifying bridge, AND ordering models (Lamport clocks provide total order via node-ID tiebreaking, vector clocks provide partial order with incomparable states, hash indexes use non-monotonic wall-clock time) are fundamentally incompatible across modules.

### dynamo-read-repair-ensures-replica-consistency [OUT] OBSERVATION
Eager all-node read repair after every quorum read ensures all replicas converge to the latest version, because repair propagates to all reachable nodes (not just quorum participants) and per-key versioning prevents false cross-key conflicts.

### event-sourcing-state-is-fully-reconstructible [OUT] OBSERVATION
Event sourcing state can be fully reconstructed from the event log: projections are stateless replay functions (cache, not source of truth), and snapshots use deep copy to isolate stored state from mutations.

### fencing-provides-linearizable-writes [OUT] OBSERVATION
Fencing tokens provide linearizable write protection at the resource server: stale tokens are permanently rejected (no expiration) and the monotonic token ordering ensures only the most recent lock holder can successfully write.

### hash-index-survives-restart [OUT] OBSERVATION
Hash-index storage correctly reconstructs its in-memory index from on-disk segments after restart by scanning in ascending order so newer writes overwrite older entries, with fsync-controlled durability for each write.

### hinted-handoff-ensures-write-availability [OUT] OBSERVATION
Hinted handoff maintains write availability during replica failures by routing writes to substitute nodes that forward data when the original replica recovers.

### lamport-mutex-provides-mutual-exclusion [OUT] OBSERVATION
The Lamport mutex correctly ensures mutual exclusion: the requesting node with the lowest timestamp gets priority, and entry requires acknowledgment from every other node in the system, preventing any two nodes from entering the critical section simultaneously.

### lsm-compaction-output-is-correct [OUT] OBSERVATION
LSM compaction produces correct merged output: tombstones are properly purged from the output SSTable, and the k-way merge is forward-only preserving sort order.

### lsm-compaction-safely-reduces-read-amplification [OUT] OBSERVATION
LSM compaction correctly reduces read amplification by merging all SSTables into a single output with last-writer-wins dedup: the full-merge strategy eliminates all redundant entries and the newest value always wins during conflict resolution, producing a minimal SSTable set.

### lsm-read-path-correct-across-flushes [OUT] OBSERVATION
The LSM read path maintains correctness across memtable flushes by searching newest-first (memtable then SSTables in reverse sequence order) using a reference swap rather than deep copy for the frozen memtable.

### lsm-wal-provides-crash-recovery [OUT] OBSERVATION
The LSM WAL replays on construction to recover unflushed memtable state, providing crash recovery for in-flight writes.

### multi-leader-convergence-reliable-across-topologies [OUT] OBSERVATION
Multi-leader replication achieves reliable eventual convergence regardless of network topology: sync uses a safe collect-then-distribute pattern with idempotent merge and monotonically advancing timestamps, and topology choice affects only the duration of observable divergence (linear in node count for ring, single round for all-to-all), not the final converged state.

### mvcc-isolation-survives-restart [OUT] OBSERVATION
MVCC's three-layer visibility model (append-only versions, own-writes visibility, symmetric deletion) with active sets frozen at transaction begin would provide correct snapshot isolation across process restarts if all state were persisted, since the visibility rules themselves are sound and compose correctly.

### pbft-view-change-safety-holds-with-stable-digests [OUT] OBSERVATION
PBFT view changes maintain safety (2f+1 agreement) and liveness (carrying prepared requests forward with contiguous sequence numbers into the new view), ensuring no committed request is lost across leader transitions.

### ssi-isolation-holds-under-single-thread [OUT] OBSERVATION
SSI's composed invariant layers provide correct serializable snapshot isolation: MVCC enforces visibility through append-only versions and symmetric deletion rules, and SSI validation ensures own-writes are checked before store consultation, preventing lost updates.

### sstable-point-lookup-correct-under-sort-invariant [OUT] OBSERVATION
SSTable point lookups return correct results through two-phase search (binary search on sparse index to identify block, then linear scan within block), with clean None returns for keys not present in the table.

### tob-ordering-verified-under-real-failures [OUT] OBSERVATION
Total Order Broadcast maintains identical delivery ordering across node failures: recovered nodes deliver the same slot sequence as live nodes, confirming that per-slot Paxos consensus and contiguous slot delivery enforce a single global order even through failure and recovery.

### two-incompatible-conflict-resolution-paradigms [OUT] OBSERVATION
The codebase contains two mathematically incompatible conflict resolution paradigms with no bridge between them: CRDTs provide algebraically proven convergence (idempotent, commutative merge satisfying SEC requirements), while the multi-leader strategy pattern offers LWW and custom-merge without formal convergence guarantees, and no mechanism exists to compose or translate between them.

### unbundled-catchup-produces-consistent-derived-state [OUT] OBSERVATION
Catch-up via snapshot and streaming produces consistent derived-system state equivalent to full event replay, because WAL entries are ordered by LSN and the rebuild protocol is verified to match full replay output.

### wal-batch-replay-provides-atomicity [OUT] OBSERVATION
WAL batch replay correctly identifies and rejects incomplete batches via the trailing COMMIT sentinel, providing atomic-or-nothing batch recovery semantics.

### wal-durability-effective-on-standards-compliant-fsync [OUT] OBSERVATION
The WAL's two-tier durability model provides effective crash protection: critical operations (checkpoints, batch commits, rotations) unconditionally force fsync while per-write sync respects the configured mode, creating a meaningful durability hierarchy where batch boundaries are always durable regardless of the performance/durability tradeoff chosen for individual writes.

### wal-is-reliable-source-of-truth [OUT] OBSERVATION
The WAL serves as the authoritative source from which the entire storage engine can be rebuilt via replay.

### wal-multi-segment-continuity-is-reliable [OUT] OBSERVATION
Multi-segment WAL replay provides reliable cross-segment continuity: segment rotation fsyncs the outgoing file before closing it, and EOF within one segment advances to the next rather than terminating replay, ensuring no inter-segment data loss during normal operation.

### wal-sequence-numbers-enable-ordered-recovery [OUT] OBSERVATION
WAL sequence numbers provide infrastructure for ordered recovery: they are recovered from disk on init, checkpoints consume positions in the same monotonic space, and the total ordering could enable gap detection and ordered replay.

### wal-truncation-safe-when-linear [OUT] OBSERVATION
WAL truncation maintains data safety through two complementary mechanisms: segments are processed in oldest-first order (so a crash mid-truncation leaves a contiguous suffix of segments), and the current WAL file is flushed and fsynced before scanning for records to remove (preventing buffered data from being lost).