Belief Registry

Claims

2pc-abort-guarantees-no-side-effects [IN] OBSERVATION

When any participant is unavailable, execute() returns "aborted" and no participant's state is modified, even for participants that were available and could have committed independently

Source: entries/2026/05/29/two-phase-commit-test_2pc.md

2pc-blocking-window [IN] OBSERVATION

The coordinator logs "committing" before any participant receives the commit decision, creating a window where a crash leaves participants locked indefinitely in the "prepared" state with no authority to self-resolve.

Source: entries/2026/05/29/topic-2pc-blocking-problem.md

2pc-blocking-window-is-between-prepare-and-decision [IN] OBSERVATION

A participant that has voted YES in prepare() holds locks and cannot unilaterally commit or abort until it receives the coordinator's decision, creating an unbounded blocking window if the coordinator fails

Source: entries/2026/05/29/topic-three-phase-commit.md

2pc-coordinator-recover-requires-liveness [IN] OBSERVATION

Coordinator.recover() can only re-send decisions to participants that are currently available (is_available() check), leaving unavailable participants still in-doubt with no resolution path

Source: entries/2026/05/29/topic-three-phase-commit.md

2pc-coordinator-recovery-replays-commit [IN] OBSERVATION

A coordinator with a "committing" log entry will re-send commit decisions to all participants during recover(), ensuring that a crash after the commit decision but before delivery still completes the transaction

Source: entries/2026/05/29/two-phase-commit-test_2pc.md

2pc-has-design-implementation-gap [IN] OBSERVATION

Two-phase commit has a systematic gap between its safety design and implementation: the coordinator has a known blocking window between logging and sending the decision, timeouts are accepted as a parameter but never enforced, and recovery requires all participants to be available — defeating the protocol's availability goals.

2pc-lock-ownership-guard [IN] OBSERVATION

Participant.abort() only releases locks where self.locks.get(op["key"]) == tx_id, preventing one transaction's abort from releasing another transaction's lock

Source: entries/2026/05/29/two-phase-commit-twophasecommit.md

2pc-locks-held-during-in-doubt [IN] OBSERVATION

Locks acquired during prepare() are only released by commit() or abort(), meaning an in-doubt transaction blocks all subsequent transactions on the same keys indefinitely until the coordinator recovers

Source: entries/2026/05/29/topic-three-phase-commit.md

2pc-locks-released-on-both-paths [IN] OBSERVATION

Locks held by a transaction are released regardless of whether the transaction commits or aborts, preventing deadlocks across sequential transactions

Source: entries/2026/05/29/two-phase-commit-test_2pc.md

2pc-log-before-decision [IN] OBSERVATION

The coordinator appends "committing" or "aborting" to its log before broadcasting the decision to participants, enabling crash recovery via recover() to re-send decisions that were never delivered

Source: entries/2026/05/29/two-phase-commit-twophasecommit.md

2pc-participant-recover-returns-in-doubt-only [IN] OBSERVATION

Participant.recover() identifies transactions stuck in "prepared" state but provides no mechanism to resolve them without the coordinator, making it a diagnostic tool rather than a recovery procedure

Source: entries/2026/05/29/topic-three-phase-commit.md

2pc-recovery-single-pass [IN] OBSERVATION

Coordinator.recover() scans the log once and re-sends decisions to currently-available participants; it does not retry or schedule follow-ups, so a participant still down at recovery time stays locked until recover() is manually called again.

Source: entries/2026/05/29/topic-2pc-blocking-problem.md

2pc-result-dict-not-exceptions [IN] OBSERVATION

The 2PC protocol communicates all outcomes (commits, aborts, lock conflicts, unavailability) via {"outcome": "committed"|"aborted", "reason": ...} return dicts; no method raises exceptions

Source: entries/2026/05/29/two-phase-commit-test_2pc.md

2pc-safety-gaps-compound-under-synchronous-simulation [IN] OBSERVATION

Two-phase commit's design-implementation gaps (known blocking window with no timeout enforcement, recovery requiring participant availability) are validated only under synchronous simulation where messages arrive instantly, meaning the real-world impact of coordinator crashes during the blocking window — where participants hold locks indefinitely — remains untested.

2pc-timeout-unused [IN] OBSERVATION

The timeout parameter is accepted by Coordinator._init but never referenced in any logic; availability is modeled via boolean available flags on participants instead of actual time-based timeouts

Source: entries/2026/05/29/two-phase-commit-twophasecommit.md

2pc-unanimous-vote [IN] OBSERVATION

A transaction commits only if every participant in participant_operations votes "yes"; a single "no" vote triggers abort for all participants regardless of how many voted yes

Source: entries/2026/05/29/two-phase-commit-twophasecommit.md

2pc-uses-key-level-locking [IN] OBSERVATION

Participants lock at key granularity during prepare(); a second transaction touching a locked key receives a "no" vote and aborts rather than waiting or queuing

Source: entries/2026/05/29/two-phase-commit-test_2pc.md

abort-is-status-change-not-disk-rollback [IN] OBSERVATION

Aborting a transaction in both MVCC and SSI implementations sets a status flag (_aborted set or status marker); no disk writes are reversed because uncommitted data never reached disk

Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

aborted-writes-universally-invisible [IN] OBSERVATION

A version created by an aborted transaction is invisible to every transaction, and a deletion by an aborted transaction is ignored — aborted transactions' effects are completely erased from all snapshots.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-is_visible.md

add-node-incremental-ring-mutation [IN] OBSERVATION

add_node mutates the ring after each vnode insertion, so subsequent vnodes in the same loop see predecessors from earlier iterations, preventing overlapping transfer arcs.

Source: entries/2026/05/29/topic-transfer-map-accuracy.md

add-node-mutation-cost-is-quadratic-in-ring-size [IN] OBSERVATION

Each addnode call performs vnodecount list insertions into a sorted list, each O(total ring entries), making node addition O(V × N×V) in the worst case.

Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

add-node-same-owner-guard [IN] OBSERVATION

The oldowner != nodeid guard at consistent_hashing.py:38 skips transfer recording when a new vnode's successor is another vnode of the same node, avoiding double-counting already-claimed arcs.

Source: entries/2026/05/29/topic-transfer-map-accuracy.md

advance-time-injects-external-watermark [IN] OBSERVATION

advance_time(timestamp) lets an external orchestrator push the watermark forward without receiving an actual event, enabling idle-stream handling and explicit trigger semantics; it rejects timestamps at or below the current watermark

Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

all-fsync-sites-data-integrity-only [IN] OBSERVATION

None of the 13 os.fsync() call sites have callers that depend on mtime or ctime metadata being accurate; all syncs exist purely for data durability, making every site a valid fdatasync candidate

Source: entries/2026/05/29/topic-fdatasync-vs-fsync-tradeoffs.md

all-three-implementations-use-payload-only-crc [IN] OBSERVATION

The WAL, Bitcask, and B-tree storage engine all compute CRC over data payloads only, excluding their respective header/framing metadata fields — a consistent pattern across the repo

Source: entries/2026/05/28/topic-wal-crc-coverage-gap.md

all-to-all-converges-in-one-round [IN] OBSERVATION

ALLTOALL topology delivers every pending change to every other node in a single sync() call, achieving convergence in one round when no custom-merge cascades occur.

Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

allocate-page-no-initialization [IN] OBSERVATION

allocate_page returns a page number containing whatever stale data was previously on disk; the caller must overwrite it with valid page content before the next WAL commit

Source: entries/2026/05/28/b-tree-storage-engine-btree-allocate_page.md

allocate-page-prefers-free-list [IN] OBSERVATION

allocate_page always recycles from the free list before extending the file with a new page, keeping the data file compact after deletions

Source: entries/2026/05/28/b-tree-storage-engine-btree-allocate_page.md

anti-entropy-counts-per-node-writes [IN] OBSERVATION

The repair count returned by antientropyrepair() reflects individual node writes, not unique keys repaired; one stale key across N nodes counts as N repairs

Source: entries/2026/05/29/leaderless-replication-dynamo-antientropyrepair.md

anti-entropy-detects-but-cannot-fully-resolve-divergence [IN] OBSERVATION

Anti-entropy can precisely locate divergent key ranges via Merkle tree diffs but cannot fully reconcile them: tombstone semantics differ at every layer (empty-bytes sentinel in LSM, preserved-by-default in merge, replication-convergence-dependent in distributed), preventing consistent cross-replica resolution of deleted keys.

anti-entropy-does-not-advance-versions [IN] OBSERVATION

antientropyrepair() propagates existing VersionedValue entries without modifying versioncounters, so it never creates new versions — only copies the highest existing version to lagging replicas

Source: entries/2026/05/29/leaderless-replication-dynamo-antientropyrepair.md

anti-entropy-highest-version-wins [IN] OBSERVATION

Anti-entropy repair assumes a total ordering on versions where the highest version number is authoritative, which is safe because put() uses a global per-key version counter ensuring no two writes produce the same version

Source: entries/2026/05/29/leaderless-replication-dynamo-antientropyrepair.md

anti-entropy-layers-are-decoupled [IN] OBSERVATION

The gossip, merkle-tree, vector-clock, read-repair, and hinted-handoff modules are independent implementations with no cross-imports; composing them into a full Dynamo-style anti-entropy pipeline is left to the integrator

Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

anti-entropy-repair-skips-unavailable [IN] OBSERVATION

Both ReadRepairStore.antientropyrepair() and DynamoCluster.antientropyrepair() only scan and repair currently available replicas; a down replica must rely on a future anti-entropy run after it recovers.

Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

anti-entropy-skips-unavailable-nodes [IN] OBSERVATION

Anti-entropy repair only reads from and writes to nodes where is_available is true; down nodes are excluded from both key discovery and repair propagation

Source: entries/2026/05/29/leaderless-replication-dynamo-antientropyrepair.md

anti-entropy-unions-all-keys [IN] OBSERVATION

antientropyrepair() collects keys from all available nodes via set union, so a key present on any single node is propagated to all other nodes regardless of whether it was intentionally removed elsewhere

Source: entries/2026/05/29/topic-leaderless-deletion-gap.md

apfs-masks-linux-fsync-bugs [IN] OBSERVATION

macOS APFS provides implicit rename durability through its CoW transaction model, so the missing directory fsync is a latent bug that only surfaces on Linux filesystems (ext4, XFS)

Source: entries/2026/05/28/topic-directory-fsync-semantics.md

append-batch-no-rollback [IN] OBSERVATION

If an exception occurs mid-batch in appendbatch (e.g., disk I/O failure in persistevent), already-appended events remain in events, _streams, and on disk with no rollback — the process continues with inconsistent in-memory state

Source: entries/2026/05/29/topic-event-sourcing-concurrency-model.md

append-batch-shares-references [IN] OBSERVATION

Returned Event objects from appendbatch are the same instances stored in events; no defensive copy is made for either the event or its data dict, so caller mutation corrupts store state.

Source: entries/2026/05/29/event-sourcing-store-eventstore-appendbatch.md

append-batch-version-check-is-precondition [IN] OBSERVATION

appendbatch checks expectedversion once before the write loop; it does not re-check after each event, so the guard is a pre-condition gate that rejects the entire batch cleanly on mismatch, not a per-event invariant

Source: entries/2026/05/29/topic-event-sourcing-concurrency-model.md

apply-byzantine-is-outbound-only [IN] OBSERVATION

applybyzantine is applied exclusively to outgoing messages; incoming messages are never filtered through it, so a Byzantine node's internal state (prepared/committed sets) remains consistent even while it sends corrupted messages

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-applybyzantine.md

apply-remote-change-advances-lamport-clock [IN] OBSERVATION

applyremotechange always advances the node's Lamport clock to max(localclock, remotets) + 1 before any conflict resolution, ensuring subsequent local writes have causally-later timestamps than any received change.

Source: entries/2026/05/29/multi-leader-replication-multileader-applyremote_change.md

apply-remote-change-no-merge-fn-guard [IN] OBSERVATION

applyremotechange does not validate that mergefn is non-None when strategy is CUSTOMMERGE; passing None raises TypeError at call time rather than a descriptive error.

Source: entries/2026/05/29/multi-leader-replication-multileader-applyremote_change.md

async-index-defers-mutations [IN] OBSERVATION

With asyncindex=True, TermPartitionedDB queues index operations in pending instead of applying them immediately; flush_index() must be called to drain the queue, modeling asynchronous global index updates.

Source: entries/2026/05/29/secondary-index-partitioning-secondaryindexpartitioning.md

auto-commit-not-default [IN] OBSERVATION

Consumer defaults to auto_commit=False (line 182), requiring explicit commit() calls; at-least-once semantics via auto-commit are opt-in, not the default behavior

Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

avro-block-encoding-zero-terminated [IN] OBSERVATION

Arrays and maps are encoded as a sequence of count-prefixed blocks terminated by a 0-count block; counts and elements use zigzag+varint via writelong/readlong

Source: entries/2026/05/29/topic-avro-block-encoding.md

avro-encoder-positive-counts-only [IN] OBSERVATION

The AvroEncoder always writes positive counts for array/map blocks, never negative counts with byte-size prefixes, trading skip efficiency for encoder simplicity

Source: entries/2026/05/29/topic-avro-block-encoding.md

avro-encoding-prioritizes-compactness-over-self-description [IN] OBSERVATION

Avro's binary encoding systematically prioritizes compactness over self-description: the format contains no type tags or field names (requiring both schemas for decoding), arrays use zero-terminated blocks for streaming, and negative block counts enable O(1) skipping — all optimizations that assume schema availability.

avro-enum-default-fallback [IN] OBSERVATION

When a writer sends an enum symbol not in the reader's symbol list, the reader falls back to the default symbol declared in the reader schema rather than raising an error

Source: entries/2026/05/29/avro-serializer-test_avro.md

avro-int-bounds-enforced [IN] OBSERVATION

The Avro int type enforces 32-bit signed integer range ([-2^31, 2^31-1]) at encode time via ValueError, while long accepts values at least as large as 2^40

Source: entries/2026/05/29/avro-serializer-testavroserializer.md

avro-missing-reader-field-requires-default [IN] OBSERVATION

During record resolution, a reader field absent from the writer schema must have a default value; missing defaults cause SchemaCompatibilityError, making defaults the mechanism for forward/backward compatibility.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-negative-count-carries-byte-size [IN] OBSERVATION

A negative block count -N in Avro's spec means N elements follow, preceded by a byte-size long that enables O(1) skipping of the entire block without parsing elements

Source: entries/2026/05/29/topic-avro-block-encoding.md

avro-no-default-sentinel [IN] OBSERVATION

NODEFAULT is a sentinel object distinguishing "no default provided" from a None/null default; this matters because null is a valid Avro default value, and conflating the two would break canonical form computation and default-filling during resolution.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-no-self-describing-format [IN] OBSERVATION

Avro binary encoding contains no type tags, field names, or schema metadata; decoding is impossible without the writer schema.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-promotions-one-directional [IN] OBSERVATION

Type promotion during schema resolution is strictly one-directional (int→long→float→double, int→float, int→double, long→double); reverse promotion (e.g., long→int) is not supported and raises SchemaCompatibilityError.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-record-fields-wire-order [IN] OBSERVATION

Record fields are encoded and decoded in declaration order of the writer schema with no field names on the wire; reordering fields in the schema definition changes the binary format and breaks existing data.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-registry-4byte-schema-id [IN] OBSERVATION

SchemaRegistry prepends a 4-byte big-endian unsigned int schema ID (struct.pack('>I', id)) to encoded payloads, modeling Confluent's Schema Registry wire format; schema IDs are limited to [0, 2^32).

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-schema-canonical-equality [IN] OBSERVATION

Schema("int") and Schema({"type": "int"}) are equal and hash-equal, enabling consistent set/dict usage regardless of whether the schema was constructed from a string or dict form

Source: entries/2026/05/29/avro-serializer-testavroserializer.md

avro-schema-error-vs-compatibility-error [IN] OBSERVATION

SchemaError is raised at parse time for structurally invalid definitions (a valid Schema object is guaranteed well-formed); SchemaCompatibilityError is raised during decode resolution or compatibility checking when two schemas cannot be reconciled.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-skip-is-recursive-not-bytejump [IN] OBSERVATION

The decoder's _skip method discards unwanted fields by recursively walking the writer schema rather than using byte-size jumps, which works correctly but requires understanding the element schema

Source: entries/2026/05/29/topic-avro-block-encoding.md

avro-union-disambiguation [IN] OBSERVATION

When encoding a Python dict into a union containing both a record and a map type, the encoder resolves ambiguity by matching dict keys against record field names first, falling back to map only if no record matches

Source: entries/2026/05/29/avro-serializer-testavroserializer.md

avro-union-no-duplicate-types [IN] OBSERVATION

A union schema cannot contain two branches with the same type name (or record name); this constraint is enforced at schema parse time in Schema._parse.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

avro-writer-reader-dual-schema-resolution [IN] OBSERVATION

AvroDecoder always requires both a writer schema (to parse the wire bytes) and a reader schema (to shape the output); this dual-schema resolution is Avro's core mechanism for schema evolution.

Source: entries/2026/05/29/avro-serializer-avro_serializer.md

batch-append-not-crash-safe [IN] OBSERVATION

append_batch can leave a partial batch on disk if the process crashes mid-write, since events are individually appended to the NDJSON file without a transaction marker or write-ahead log

Source: entries/2026/05/29/event-sourcing-store-event_store.md

batch-mode-defers-fsync [IN] OBSERVATION

In batch sync mode, individual WAL appends do not fsync until writecount reaches batchsync_count (default 100), leaving up to 99 records vulnerable to crash loss

Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

batch-notifies-after-all-stored [IN] OBSERVATION

append_batch() stores all events in the batch first, then notifies subscribers of each event sequentially, ensuring subscribers see a consistent state where all batch events exist

Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

batch-write-count-not-reset-on-forced-sync [IN] OBSERVATION

The WAL writecount counter only resets when the batch threshold triggers a sync (line 133), not when a forced sync occurs, which may cause counter drift between forced and threshold-triggered syncs

Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

binary-formats-rigid-across-entire-storage-stack [IN] OBSERVATION

The entire storage stack uses rigid binary formats that preclude both forward evolution and post-corruption recovery: WAL records are contiguously packed with no block alignment or version negotiation preventing resync after mid-file corruption, and SSTables lack per-entry checksums and efficient skip structures — neither layer can be upgraded in place or self-repaired after partial damage.

bitcask-auto-compact-threshold [IN] OBSERVATION

When the number of frozen segments exceeds autocompactthreshold (default 5), compaction is triggered automatically during put() operations

Source: entries/2026/05/28/log-structured-hash-table-test_bitcask.md

bitcask-compact-bypasses-write-record [IN] OBSERVATION

compact() re-serializes records directly to the output file rather than calling writerecord, duplicating the binary serialization logic in a separate code path

Source: entries/2026/05/29/log-structured-hash-table-bitcask-writerecord.md

bitcask-compact-not-crash-safe [IN] OBSERVATION

Hash-index-storage compaction is non-atomic with no error handling: a crash between deleting old files and completing the rewrite can leave the store in an unrecoverable state with no rollback mechanism.

Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

bitcask-compact-preserves-timestamps [IN] OBSERVATION

Compaction rewrites records with their original timestamps (not the time of compaction), preserving timestamp-based ordering across merges.

Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

bitcask-compact-skips-active-file-keys [IN] OBSERVATION

During compaction, keys whose keydir entry points to the active file are excluded from merge output; their immutable-file copies are treated as stale because the active file holds a newer value.

Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

bitcask-compact-writes-hint-files [IN] OBSERVATION

Compaction produces .hint files alongside each merged data file (including mid-compaction when a merged file hits maxfilesize), enabling O(keys) index rebuilds instead of O(records) full scans on next startup.

Source: entries/2026/05/28/hash-index-storage-bitcask-compact.md

bitcask-compaction-not-crash-safe [IN] OBSERVATION

Compaction performs delete-then-rename without journaling; a crash mid-compaction can leave the store in a state that _recover() cannot reconstruct correctly

Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

bitcask-compaction-preserves-observable-state [IN] OBSERVATION

After compact(), every key returns the same value as before compaction and deleted keys remain absent; compaction changes only physical layout, never logical state.

Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

bitcask-compaction-renumbers-active [IN] OBSERVATION

After compaction, merged files get IDs above the old active file, and the active file is renamed to an even higher ID to maintain the monotonically increasing file ID invariant.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-crash-recovery-without-hints [IN] OBSERVATION

BitcaskStore can rebuild its in-memory index by scanning .data files alone when .hint files are missing, producing identical read results to a clean startup with hint files present.

Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

bitcask-crc-covers-payload-only [IN] OBSERVATION

CRC32 is computed over keybytes + valuebytes only; the 12-byte header (which contains the CRC itself) is excluded from the checksum input

Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

bitcask-crc-validated-on-read-only [IN] OBSERVATION

In the log-structured hash table, CRC32 validation occurs only during get() and scansegment(); writes compute and store the CRC but never verify it back, so a corrupted write is not caught until read time

Source: entries/2026/05/29/topic-bitcask-binary-format.md

bitcask-crc32-raises-corruption-error [IN] OBSERVATION

Reading a record whose payload does not match its CRC32 header raises CorruptionError — the store never silently returns corrupt data

Source: entries/2026/05/29/log-structured-hash-table-testertestbitcask.md

bitcask-delete-nonexistent-is-noop [IN] OBSERVATION

Calling delete() on a key that doesn't exist in the store completes without error.

Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

bitcask-file-rotation-transparent [IN] OBSERVATION

File rotation triggered by maxfilesize is invisible to readers — all keys remain retrievable across multiple .data files because the keydir maps each key to its specific (file_id, offset).

Source: entries/2026/05/29/hash-index-storage-testertestbitcask.md

bitcask-fsync-per-record-default [IN] OBSERVATION

syncwrites defaults to True, meaning every write_record call triggers an fsync — durable by default at significant write throughput cost.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-get-missing-returns-none [IN] OBSERVATION

BitcaskStore.get() returns None for missing or deleted keys, never raises KeyError.

Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

bitcask-get-no-file-existence-guard [IN] OBSERVATION

If a segment file is deleted out-of-band or by a crash during compaction, get() raises an unhandled FileNotFoundError rather than returning None or a graceful error

Source: entries/2026/05/29/log-structured-hash-table-bitcask-get.md

bitcask-get-opens-fresh-handle [IN] OBSERVATION

get() opens a new file handle per call rather than using the filehandles cache, making the handle cache effectively unused for reads

Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

bitcask-get-single-disk-seek [IN] OBSERVATION

get() performs exactly one disk seek per call; the key-to-offset mapping is resolved entirely in memory via self._index, making reads O(1) index lookup plus one disk read

Source: entries/2026/05/29/log-structured-hash-table-bitcask-get.md

bitcask-header-is-12-bytes [IN] OBSERVATION

The log-structured hash table's HEADERFMT = "!III" produces a fixed 12-byte header (three big-endian uint32s: CRC, keysize, value_size) at log-structured-hash-table/bitcask.py:10-11; total record size is always 12 + len(key) + len(value)

Source: entries/2026/05/29/topic-bitcask-binary-format.md

bitcask-hint-files-exclude-tombstones [IN] OBSERVATION

createhintfiles() skips tombstone records and only emits entries whose index currently points to that segment+offset, so hint-based recovery cannot distinguish "key was deleted" from "key never existed in this segment"

Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

bitcask-hint-files-skip-scan [IN] OBSERVATION

When a .hint file exists for a file ID, rebuildindex loads it instead of scanning the data file; hint files have no checksum validation, so they must be written atomically with their data files during compaction.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-keydir-is-sole-index [IN] OBSERVATION

The in-memory keydir dict is the only index; every live key has exactly one KeyEntry pointing to its most recent non-tombstone record on disk.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-no-checksum-validation [IN] OBSERVATION

Records have no CRC or integrity checksum; the only corruption guard is the assert read_key == key in get(), which catches index/data mismatches but not bit-rot.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-no-mmap [IN] OBSERVATION

Neither Bitcask implementation uses memory-mapped I/O; both use standard open()/read() with manual buffering, forgoing kernel page cache optimizations that production Bitcask implementations rely on for startup performance

Source: entries/2026/05/29/topic-bitcask-startup-cost.md

bitcask-no-parallel-rebuild [IN] OBSERVATION

Both Bitcask implementations process data/hint files strictly sequentially during keydir rebuild with no threading, multiprocessing, or async I/O, even though hint file loading is embarrassingly parallel

Source: entries/2026/05/29/topic-bitcask-startup-cost.md

bitcask-no-range-query-support [IN] OBSERVATION

Neither Bitcask implementation has any range query, ordered iteration, or prefix scan capability; the hash-based keydir only supports exact-key point lookups

Source: entries/2026/05/29/topic-bitcask-vs-lsm-tree.md

bitcask-no-validation-on-segment-filenames [IN] OBSERVATION

If a file matching segment*.dat contains non-numeric characters in the ID position, findexistingsegments crashes with ValueError; no validation guards against malformed filenames

Source: entries/2026/05/29/log-structured-hash-table-bitcask-findexisting_segments.md

bitcask-partial-write-safe [IN] OBSERVATION

On startup recovery, the store skips incomplete records at segment tail (header present but payload truncated) without raising errors or losing previously committed data

Source: entries/2026/05/28/log-structured-hash-table-test_bitcask.md

bitcask-reader-not-threadsafe [IN] OBSERVATION

File handles in self.filehandles are shared and mutable (via seek), so concurrent calls to readrecord on the same fileid would race.

Source: entries/2026/05/29/hash-index-storage-bitcask-readrecord.md

bitcask-rebuild-sorts-ascending [IN] OBSERVATION

rebuildindex sorts file IDs ascending before scanning so that newer records overwrite older ones in the keydir, enforcing last-write-wins semantics.

Source: entries/2026/05/28/hash-index-storage-bitcask-rebuildindex.md

bitcask-record-format-symmetry [IN] OBSERVATION

readrecord and writerecord share the same binary layout (<dII header + key bytes + value bytes); a change to either must be mirrored in the other.

Source: entries/2026/05/29/hash-index-storage-bitcask-readrecord.md

bitcask-recovery-order-is-ascending [IN] OBSERVATION

Recovery scans segments in ascending ID order so that newer writes overwrite older index entries, and the highest-ID segment becomes the active segment for appending

Source: entries/2026/05/28/log-structured-hash-table-bitcask.md

bitcask-rename-is-rotation-not-safety [IN] OBSERVATION

The os.rename calls in both Bitcask implementations rename active segments to frozen segments for namespace management, not for crash-safe atomic file creation — the file is already populated at its original path before the rename

Source: entries/2026/05/29/topic-atomic-write-via-temp-rename.md

bitcask-rotation-is-soft-cap [IN] OBSERVATION

File rotation is checked before each write, so the active file can exceed maxfilesize by up to one record's worth of bytes; the limit is a soft cap, not a hard one

Source: entries/2026/05/29/hash-index-storage-bitcask-mayberotate.md

bitcask-rotation-preserves-read-handles [IN] OBSERVATION

When the active file is rotated, its read handle in file_handles is preserved so existing KeyEntry references pointing to the old file remain valid

Source: entries/2026/05/29/hash-index-storage-bitcask-mayberotate.md

bitcask-scan-short-payload-undetected [IN] OBSERVATION

In scandata_file, a truncated header safely breaks the scan loop, but a short key/value read after a valid header is not detected — the truncated data is silently used, potentially corrupting the keydir.

Source: entries/2026/05/29/hash-index-storage-bitcask-scandata_file.md

bitcask-segment-discovery-is-filesystem-based [IN] OBSERVATION

findexisting_segments uses os.listdir as the source of truth for which segments exist; there is no persistent manifest file — the filesystem is the manifest

Source: entries/2026/05/29/log-structured-hash-table-bitcask-findexisting_segments.md

bitcask-segment-naming-must-sync [IN] OBSERVATION

The filename parsing in findexistingsegments (prefix/suffix slicing to extract segment ID) must stay in sync with the format string in segment_path; no shared constant enforces this coupling

Source: entries/2026/05/29/log-structured-hash-table-bitcask-findexisting_segments.md

bitcask-sequential-file-ids [IN] OBSERVATION

File IDs are assigned by incrementing activefileid by 1 with no gap detection or collision check; compaction updates activefileid to avoid collisions after renumbering

Source: entries/2026/05/29/hash-index-storage-bitcask-mayberotate.md

bitcask-single-active-writer [IN] OBSERVATION

Exactly one data file is open for writes at any time; all other files are immutable and read-only.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-tests-disable-sync [IN] OBSERVATION

All tests in testbitcask.py pass syncwrites=False, meaning the durable fsync-per-write code path is untested.

Source: entries/2026/05/28/hash-index-storage-test_bitcask.md

bitcask-tombstone-handling-asymmetry [IN] OBSERVATION

scandatafile removes tombstoned keys from keydir during rebuild, but loadhintfile unconditionally inserts entries — it relies on compaction having already stripped tombstones before writing the hint file.

Source: entries/2026/05/28/hash-index-storage-bitcask-rebuildindex.md

bitcask-tombstone-invisible-to-get [IN] OBSERVATION

get() never encounters tombstone records because delete() removes the key from self._index; tombstone handling is solely a recovery and compaction concern

Source: entries/2026/05/29/log-structured-hash-table-bitcask-get.md

bitcask-tombstone-is-empty-string [IN] OBSERVATION

Deletion is represented by writing a record with an empty-string value; both scandatafile (checks valsize == 0) and get (checks value == "") use this convention.

Source: entries/2026/05/28/hash-index-storage-bitcask.md

bitcask-tombstone-semantics [IN] OBSERVATION

BitcaskStore.delete() writes a tombstone record (b"_BITCASKTOMBSTONE__"); get() returns None for tombstoned keys, and compact() removes both the original entry and the tombstone

Source: entries/2026/05/28/log-structured-hash-table-test_bitcask.md

bitcask-write-record-no-index-update [IN] OBSERVATION

writerecord only appends to disk and returns a byte offset; it never modifies self._index, leaving index management entirely to callers (put, delete, compact)

Source: entries/2026/05/29/log-structured-hash-table-bitcask-writerecord.md

bloom-and-counting-share-hash-function [IN] OBSERVATION

BloomFilter and CountingBloomFilter use the identical _hashes() double-hashing scheme (MD5-based, lines 9-14), producing the same bit/counter positions for identical parameters — so a BloomFilter and CountingBloomFilter with the same m and k will agree on membership queries for the same insertions.

Source: entries/2026/05/29/topic-cuckoo-filters.md

bloom-compaction-rebuild-not-patch [IN] OBSERVATION

For SSTable compaction, rebuilding a fresh BloomFilter during the merge write pass is simpler and more correct than incrementally patching a CountingBloomFilter with remove(); compaction already pays O(N) I/O so filter construction adds only constant-factor overhead

Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

bloom-double-hashing [IN] OBSERVATION

_hashes() uses Kirschner-Mitzenmacher double hashing from a single MD5 digest, producing k positions as (h1 + i*h2) % m without k independent hash functions

Source: entries/2026/05/28/topic-bloom-filter-integration.md

bloom-filter-deterministic-hashing [IN] OBSERVATION

The Bloom filter hash function is deterministic (not seeded with randomness), so two filters with identical parameters produce identical bit arrays for identical inputs — enabling equality comparison and reproducible serialization

Source: entries/2026/05/29/bloom-filter-testbloomfilter.md

bloom-filter-double-hashing [IN] OBSERVATION

BloomFilter generates k bit positions via Kirschner-Mitzenmacker double hashing: one MD5 digest is split into two 64-bit values h1 and h2, then positions are computed as (h1 + i*h2) % m rather than using k independent hash functions.

Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

bloom-filter-len-counts-insertions [IN] OBSERVATION

BloomFilter._len_ counts total add() calls, not distinct items — adding the same item twice yields len() == 2, unlike CountingBloomFilter which has the same property tracked separately

Source: entries/2026/05/29/bloom-filter-testbloomfilter.md

bloom-filter-monotonic-bits [IN] OBSERVATION

Bits in BloomFilter are only ever set, never cleared; this monotonic property makes the structure append-only, enables union via bitwise OR, and prevents deletion without a counting layer.

Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

bloom-filter-no-false-negatives [IN] OBSERVATION

After BloomFilter.add(x), x in filter always returns True; the data structure guarantees zero false negatives by design (all k bit positions are set on add and all must be set for membership).

Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

bloom-filter-not-integrated [IN] OBSERVATION

Neither LSM tree implementation (lsm.py or sstable.py) references or uses the bloom filter module; they exist as independent DDIA concept demonstrations with zero cross-module imports

Source: entries/2026/05/28/topic-bloom-filter-integration.md

bloom-filter-optimal-sizing-textbook [IN] OBSERVATION

BloomFilter computes bitcount as ceil(-n * ln(p) / ln(2)^2) and hashcount as round((m/n) * ln(2)), matching the standard optimal formulas from Bloom filter literature

Source: entries/2026/05/29/bloom-filter-testbloomfilter.md

bloom-filter-union-count-overestimates [IN] OBSERVATION

BloomFilter.union() sets count = self.count + other.count, which overstates distinct items when filters share elements; estimatecount() from bit density is more accurate for unioned filters

Source: entries/2026/05/29/bloom-filter-bloom_filter.md

bloom-filter-union-requires-identical-params [IN] OBSERVATION

BloomFilter.union() raises ValueError if the two filters differ in bit array size (m) or hash count (k), since bitwise OR is only meaningful when bit positions map to the same hash space.

Source: entries/2026/05/29/bloom-filter-bloom_filter-BloomFilter.md

bloom-hashes-no-m-zero-guard [IN] OBSERVATION

_hashes does not guard against m=0, which causes an unhandled ZeroDivisionError in the (h1 + i * h2) % m computation

Source: entries/2026/05/29/bloom-filter-bloomfilter-hashes.md

bloom-hashes-positions-not-unique [IN] OBSERVATION

The k positions returned by _hashes are not guaranteed distinct — duplicate positions can occur when m is small relative to k, and all callers (add, contains, remove) tolerate this via idempotent bit/counter operations

Source: entries/2026/05/29/bloom-filter-bloomfilter-hashes.md

bloom-md5-split-64bit [IN] OBSERVATION

_hashes splits the 128-bit MD5 digest at the midpoint into two 64-bit little-endian unsigned integers h1 and h2, used as base position and step size for double hashing

Source: entries/2026/05/29/bloom-filter-bloomfilter-hashes.md

bloom-serialization-footer-ready [IN] OBSERVATION

BloomFilter.to_bytes packs a 12-byte header (m, k, count as three little-endian uint32s) followed by the raw bit array, matching the pattern used by SSTable footers for sparse indices

Source: entries/2026/05/28/topic-bloom-filter-integration.md

both-storage-paradigms-hit-scalability-walls [IN] OBSERVATION

Both storage paradigms in the reference implementations exhibit fundamental scalability constraints: the hash index requires all keys in RAM (making dataset size directly bound by available memory with no spill-to-disk fallback), while the LSM tree scans every SSTable on negative lookups because the correctly-implemented Bloom filter module is never wired into the read path.

both-suites-discoverable-by-pytest [IN] OBSERVATION

Both testertest*.py and test*.py naming conventions match pytest's default collection patterns (test*.py and *test*.py), so pytest in any project directory runs the combined suite — note: this contradicts existing belief tester-naming-avoids-pytest-discovery; one should be retracted

Source: entries/2026/05/29/topic-tester-test-pattern.md

broadcast-hash-join-requires-small-side-in-memory [IN] OBSERVATION

BroadcastHashJoin loads the entire small dataset into a hash table at construction time via buildhash_table, so the small side must fit in a single mapper's memory

Source: entries/2026/05/29/topic-ddia-ch10-reduce-side-joins.md

broadcast-join-loads-small-side-at-construction [IN] OBSERVATION

BroadcastHashJoin receives the small dataset at construction time and builds a hash table; the large dataset is streamed through .join(), reflecting the asymmetric API where the small side must be available upfront

Source: entries/2026/05/29/map-side-join-testmapside_joins.md

btree-allocate-page-outside-wal [IN] OBSERVATION

Page allocation during splits modifies metadata through PageManager.write_meta (not WAL-logged), creating a potential crash-safety gap if the process fails between allocation and the subsequent WAL commit

Source: entries/2026/05/28/b-tree-storage-engine-btree-_insert.md

btree-bisect-left-for-leaf-exact-match [IN] OBSERVATION

Leaf lookups use bisectleft to find the leftmost insertion point, then check keys[idx] == key for exact match; bisectright would return the position after all equal keys, overshooting the target.

Source: entries/2026/05/29/topic-bisect-right-vs-left-in-btrees.md

btree-bisect-left-leaf-bisect-right-internal [IN] OBSERVATION

search uses bisectleft at leaf nodes for exact-match lookup and bisect_right at internal nodes for child routing; this asymmetry matches the split convention where the median key is copied up and retained as the first key of the right leaf.

Source: entries/2026/05/29/b-tree-storage-engine-btree-_search.md

btree-counters-reset-per-operation [IN] OBSERVATION

pm.resetcounters() is called at the start of each public method, so stats.pagesread/written reflects only the most recent operation.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-crc32-skips-corrupt-entries [IN] OBSERVATION

Corrupted WAL entries (CRC mismatch) are silently skipped during B-tree recovery rather than raising an error or halting replay.

Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

btree-delete-causes-monotonic-degradation [IN] OBSERVATION

B-tree deletion is structurally degrading: empty leaf pages are never freed, freed pages leave dangling parent pointers, and tree height only increases — causing monotonic space amplification and search-path lengthening over delete-heavy workloads.

btree-delete-cleanup-depth2-only [IN] OBSERVATION

Empty leaf cleanup only triggers when the current internal node is at depth 2 (direct parent of leaves) and the empty child is not the leftmost; empty leaves at depth > 2 or at index 0 are silently left in place

Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

btree-delete-is-immediate-not-tombstone [IN] OBSERVATION

Deletes decrement total_keys immediately and the key becomes unreadable at once; no tombstone entries are written and no deferred compaction pass is needed

Source: entries/2026/05/28/topic-btree-split-and-merge-strategy.md

btree-delete-leaks-pages [IN] OBSERVATION

BTree.delete removes keys from leaves but never calls freepage on empty leaves, causing unbounded page file growth despite PageManager having a free list mechanism.

Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

btree-delete-metadata-reread [IN] OBSERVATION

delete re-reads metadata after delete returns to pick up nextfree/freehead changes made by pm.freepage, rather than threading updated values through the recursive call stack.

Source: entries/2026/05/29/b-tree-storage-engine-btree-delete.md

btree-delete-no-commit-on-miss [IN] OBSERVATION

When the key is not found, delete writes no WAL entries and performs no commit, making failed deletes purely read-only operations.

Source: entries/2026/05/29/b-tree-storage-engine-btree-delete.md

btree-delete-skips-free-page [IN] OBSERVATION

The delete path does not call freepage even when a leaf becomes completely empty; PageManager's free list is used only by allocatepage after explicit free_page calls from other paths.

Source: entries/2026/05/28/topic-b-tree-underflow-rebalancing.md

btree-delete-three-valued-return [IN] OBSERVATION

_delete returns False (not found), True (deleted, no cleanup needed), or the string 'empty' (leaf now empty, parent should attempt cleanup); the public delete() collapses this to bool

Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

btree-depth-param-not-validated [IN] OBSERVATION

_search trusts that the depth parameter from metadata accurately reflects the tree's balance; corrupted height causes silent wrong-format deserialization (leaf data parsed as internal or vice versa) with no error.

Source: entries/2026/05/29/b-tree-storage-engine-btree-_search.md

btree-deserialize-no-validation [IN] OBSERVATION

Both deserializeleaf and deserializeinternal perform no validation of page type, sort order, or buffer length; they trust that WAL CRC checks and the serialize functions guarantee well-formed pages, so corruption produces silent wrong results rather than errors.

Source: entries/2026/05/29/b-tree-storage-engine-btree-deserializeleaf.md

btree-double-fsync-per-mutation [IN] OBSERVATION

B-tree mutations pay for os.fsync() twice: once when writing the WAL entry (btree.py:137) and again when committing the page to the data file (btree.py:105), with the WAL truncated only after the data file sync confirms durability.

Source: entries/2026/05/29/topic-fsync-vs-fdatasync.md

btree-durability-protects-data-not-structure [IN] OBSERVATION

The B-tree's durability model protects user data but not structural integrity: mutations pay double fsync for data pages (WAL entry + data write) while structural metadata is never fsynced, AND structural integrity silently erodes during normal operation (leaked pages, ever-growing height, dangling parent pointers after free_page) with no defensive checks in the I/O layer to detect or prevent degradation.

btree-empty-leaf-freed-on-delete [IN] OBSERVATION

When all keys are removed from a non-root leaf, the page is returned to the free list via PageManager.free_page() rather than persisting as an empty node

Source: entries/2026/05/28/topic-btree-split-and-merge-strategy.md

btree-file-never-shrinks [IN] OBSERVATION

The B-tree data file only grows (via nextfree bump in allocatepage) and never shrinks; no truncate, compact, or defrag operation exists anywhere in the module.

Source: entries/2026/05/29/topic-free-list-fragmentation.md

btree-fix-plan-all-crash-safety [IN] OBSERVATION

All six bugs documented in fix-plan.md concern single-writer WAL/fsync crash safety; none involve concurrent access races, illustrating that even without concurrency the WAL protocol has subtle correctness requirements

Source: entries/2026/05/29/topic-latch-coupling-postgres.md

btree-free-list-intrusive [IN] OBSERVATION

Freed pages store the "next free" pointer in their own body at HEADER_SIZE offset, forming an intrusive singly-linked list with no separate allocator data structure.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-free-page-bypasses-wal [IN] OBSERVATION

pm.free_page() called during deletion writes directly to the data file without WAL logging, creating a crash-consistency gap if the process fails between the free and the WAL commit

Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

btree-free-page-no-parent-fixup [IN] OBSERVATION

PageManager.free_page adds a page to the free list without removing the parent's downlink to it, creating a dangling-pointer risk on crash

Source: entries/2026/05/29/topic-postgres-nbtree-half-dead-state-machine.md

btree-has-page-io-counters [IN] OBSERVATION

The B-tree's PageManager tracks pagesread and pageswritten with a resetcounters() method (multiply by pagesize for bytes), but WAL writes are not counted by any existing counter

Source: entries/2026/05/29/topic-write-amplification-measurement.md

btree-height-never-shrinks [IN] OBSERVATION

Since delete never merges internal nodes or demotes the root, tree height only increases (on splits) and never decreases, regardless of how many keys are deleted.

Source: entries/2026/05/28/topic-b-tree-underflow-rebalancing.md

btree-impl-asymmetric-split-vs-merge [IN] OBSERVATION

The reference implementation fully propagates splits upward through insert's return value but swallows the 'empty' signal in delete after one level of cleanup, making insertion structurally complete but deletion structurally incomplete

Source: entries/2026/05/28/topic-postgres-nbtree-lazy-deletion.md

btree-insert-no-wal-commit [IN] OBSERVATION

_insert never calls wal.commit(); the caller (put) is solely responsible for committing after all page writes and metadata updates complete, making the entire put atomic with respect to crash recovery

Source: entries/2026/05/28/b-tree-storage-engine-btree-_insert.md

btree-internal-split-promotes-key [IN] OBSERVATION

Internal node splits promote the midpoint key into the parent (removing it from the child), unlike leaf splits which copy it (the key remains in the right child); this asymmetry allows uniform bisect_right routing at all internal levels.

Source: entries/2026/05/29/topic-bisect-right-vs-left-in-btrees.md

btree-keys-decoded-as-utf8 [IN] OBSERVATION

Leaf deserialization always decodes keys as UTF-8 strings while values remain raw bytes; invalid UTF-8 in key bytes causes UnicodeDecodeError at read time, not write time (since serialize accepts both str and bytes).

Source: entries/2026/05/29/b-tree-storage-engine-btree-deserializeleaf.md

btree-leaf-sibling-chain [IN] OBSERVATION

Leaves store a nextsibling pointer forming a left-to-right chain terminated by NOSIBLING (0xFFFFFFFF), enabling range scans without touching internal nodes.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-leaf-sibling-chain-forward-only [IN] OBSERVATION

Leaf pages store a nextsibling forward pointer (NOSIBLING = 0xFFFFFFFF sentinel) but no backward pointer, so unlinking a leaf from the sibling chain requires locating the predecessor via parent traversal rather than direct backlink.

Source: entries/2026/05/29/topic-postgres-nbtree-vacuum.md

btree-leaf-sibling-patched-on-remove [IN] OBSERVATION

When an empty leaf is removed during deletion, delete patches the previous sibling's nextsibling pointer to splice the empty leaf out of the linked list used by rangescan and iter_

Source: entries/2026/05/28/b-tree-storage-engine-btree-_delete.md

btree-leaf-split-copies-key [IN] OBSERVATION

Leaf splits copy the midpoint key into both the parent and the right leaf (key remains searchable in the leaf), while internal splits promote the midpoint key out of both children into the parent only

Source: entries/2026/05/28/b-tree-storage-engine-btree-_insert.md

btree-lookup-reads-bounded-by-height [IN] OBSERVATION

A single BTree.get() reads at most height + 1 disk pages, verified via pm.pages_read after counter reset.

Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

btree-max-keys-forces-splits [IN] OBSERVATION

With maxkeysper_page=4, inserting a 5th key into a leaf triggers a page split; tests use this small fanout to exercise multi-level tree structure with few keys.

Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

btree-metadata-bypasses-wal [IN] OBSERVATION

BTree.writemeta writes root pointer and free list head directly to the data file without logging through the WAL, making metadata updates not crash-safe.

Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

btree-mutation-fsync-is-asymmetric [IN] OBSERVATION

B-tree mutations pay double fsync costs for user data (WAL entry + data page) but skip fsync entirely for structural metadata, creating an asymmetry where key-value pairs survive crashes but the free-page list and allocation state may not.

btree-no-underflow-rebalancing [IN] OBSERVATION

Delete does not rebalance underfilled nodes; the only cleanup is freeing empty non-leftmost leaves at depth 2 — a deliberate simplification over production B-trees.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-page-alignment-isolates-corruption [IN] OBSERVATION

The B-tree's fixed-size pages (pagenum * pagesize addressing) provide natural resync boundaries for the data file — corruption of one page does not affect reads of other pages, unlike the streaming WAL formats

Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

btree-page-io-has-no-defensive-checks [IN] OBSERVATION

The B-tree's page I/O layer lacks defensive validation at every stage: deserialization accepts any bytes, leaf-finding doesn't verify node types, and page writes silently truncate oversized data — a single corrupted byte can cascade through the tree undetected.

btree-page-overwrite-no-size-change [IN] OBSERVATION

B-tree PageManager overwrites pages at fixed offsets within a pre-allocated file, so most write+sync cycles do not change file size and would benefit from fdatasync skipping metadata I/O

Source: entries/2026/05/29/topic-fdatasync-vs-fsync-tradeoffs.md

btree-page-writes-unbuffered-by-app [IN] OBSERVATION

PageManager has no application-level page cache — every readpage reads from the file and every writepage writes immediately, relying entirely on the kernel page cache for performance

Source: entries/2026/05/29/topic-o-direct-and-dio.md

btree-page-zero-is-metadata [IN] OBSERVATION

Page 0 is always the metadata page (rootpage, height, totalkeys, nextfreepage, freelisthead); the root page is never page 0.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-partial-split-replay-orphans-right-page [IN] OBSERVATION

If a crash occurs after the right page and left page are WAL-logged but before the parent update, recovery produces a consistent sibling chain but the right page is unreachable from tree descent — range scans find the keys but point lookups silently miss them

Source: entries/2026/05/29/topic-split-and-sibling-update-atomicity.md

btree-put-is-upsert [IN] OBSERVATION

BTree.put() on an existing key updates the value in-place without incrementing len(tree) — it is an upsert, not a blind insert.

Source: entries/2026/05/28/b-tree-storage-engine-testertestbtree.md

btree-put-metadata-reread [IN] OBSERVATION

put must re-read metadata after insert returns because PageManager.allocatepage mutates nextfree and freehead as a side effect during splits, and those changes are not threaded back through the return value.

Source: entries/2026/05/29/b-tree-storage-engine-btree-put.md

btree-recursive-insert-returns-union [IN] OBSERVATION

insert returns a discriminated union: None (updated existing), 'inserted' (new key, no split), or (midkey, new_page) (split happened); root splits are handled in put(), not inside the recursion.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-sibling-chain-fixed-offset [IN] OBSERVATION

The next_sibling field sits at a fixed byte offset (bytes 3–6) in every serialized leaf page, making the sibling chain walkable without fully deserializing page contents

Source: entries/2026/05/29/topic-split-during-insert.md

btree-sibling-chain-maintained-across-mutations [IN] OBSERVATION

The B-tree sibling chain is actively maintained during all structural mutations with a fixed wire format: splits write the new right page with the old leaf's nextsibling pointer before rewriting the old leaf, deletions patch the previous sibling's pointer to splice out the removed leaf, and the nextsibling field sits at a fixed byte offset (bytes 3-6) enabling reliable traversal without full page deserialization.

btree-sibling-pointers-unused-for-merge [IN] OBSERVATION

Leaf pages serialize a nextsibling pointer (via serializeleaf and NOSIBLING sentinel) providing infrastructure for merge/redistribute, but no code path reads sibling pointers during deletion.

Source: entries/2026/05/28/topic-b-tree-underflow-rebalancing.md

btree-single-file-avoids-dir-fsync-gap [IN] OBSERVATION

The B-tree's PageManager writes to a single pre-existing data file opened at construction, so normal operations never create new files and avoid the directory fsync gap that affects segment-based engines during rotation and compaction.

Source: entries/2026/05/29/topic-fsync-guarantees.md

btree-single-threaded-assumption [IN] OBSERVATION

The B-tree implementation has no locking, latching, or concurrency control; all crash-safety gaps exist even without concurrent access

Source: entries/2026/05/29/topic-postgres-nbtree-half-dead-state-machine.md

btree-single-writer [IN] OBSERVATION

put (and delete) assume single-threaded access: metadata is re-read after insert/delete without any locking or compare-and-swap, creating a TOCTOU window under concurrent writers.

Source: entries/2026/05/29/b-tree-storage-engine-btree-put.md

btree-split-write-order-protects-chain [IN] OBSERVATION

During a leaf split, _insert writes the new right page to the WAL before the modified left page, ensuring the sibling pointer target is durable before any pointer references it — so the chain never points to a nonexistent page after crash recovery

Source: entries/2026/05/29/topic-split-and-sibling-update-atomicity.md

btree-stdlib-only [IN] OBSERVATION

The B-tree module uses only stdlib imports (os, struct, zlib, bisect) with zero external dependencies.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-structural-integrity-silently-erodes [IN] OBSERVATION

B-tree structural integrity silently erodes over time: deletions cause monotonic degradation (leaked pages, ever-growing height) while the I/O layer performs no validation to detect or report the resulting inconsistencies.

btree-survives-close-reopen [IN] OBSERVATION

Data written before BTree.close() is fully recoverable by constructing a new BTree on the same directory, including updates and deletes.

Source: entries/2026/05/28/b-tree-storage-engine-testertestbtree.md

btree-traversal-is-forward-only [IN] OBSERVATION

B-tree sequential access is strictly forward-only: iteration descends the left spine (following children at index zero at every internal level) to reach the leftmost leaf, then walks the forward-only sibling chain, with no backward pointer, reverse iterator, or random leaf access mechanism.

btree-uses-bisect-not-linear-scan [IN] OBSERVATION

In-node key lookup uses bisectleft/bisectright (binary search), making within-node search O(log B) rather than the textbook O(B) linear scan — shifting the optimal branching factor higher than classical B-tree analysis predicts.

Source: entries/2026/05/29/topic-index-size-vs-scan-cost-benchmarking.md

btree-wal-before-data [IN] OBSERVATION

Every page mutation is logged to the WAL (with fsync) before being written to the data file; recovery replays the WAL, so no committed write is lost on crash.

Source: entries/2026/05/28/b-tree-storage-engine-btree.md

btree-wal-checksum-excludes-page-num [IN] OBSERVATION

In b-tree-storage-engine/btree.py, the WAL entry checksum covers only pagedata, so a corrupted pagenum during recovery writes a valid page to the wrong disk location without detection — the most dangerous integrity gap in the codebase

Source: entries/2026/05/29/topic-crc-coverage-audit.md

btree-wal-entry-includes-full-page [IN] OBSERVATION

Each B-tree WAL entry is 16 + pagesize bytes (12-byte header with seq/pagenum/data_len, full page image, 4-byte CRC32), making WAL writes the dominant cost for small key-value updates

Source: entries/2026/05/29/topic-write-amplification-measurement.md

btree-wal-fsync-per-entry [IN] OBSERVATION

Each WAL.log_write() call performs an fsync, guaranteeing that WAL entries are durable before any data page writes proceed

Source: entries/2026/05/29/topic-split-atomicity-via-wal.md

btree-wal-is-physical-redo-only [IN] OBSERVATION

The B-tree WAL in btree.py stores full after-image page data (physical WAL) and supports only redo recovery; there is no undo capability

Source: entries/2026/05/29/topic-write-ahead-logging-fsync-ordering.md

btree-wal-no-data-fsync [IN] OBSERVATION

PageManager.write_page calls flush() but never os.fsync(), so written pages may not be durable on crash even after the WAL has committed.

Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

btree-wal-no-operation-boundaries [IN] OBSERVATION

The WAL logs individual page writes and commits by full truncation; it cannot represent a multi-step structural operation as an atomic unit

Source: entries/2026/05/29/topic-postgres-nbtree-half-dead-state-machine.md

btree-wal-provides-split-atomicity [IN] OBSERVATION

Multi-page operations (splits, deletes that free pages) are made crash-safe by writing all page modifications to the WAL before applying them to the data file

Source: entries/2026/05/28/topic-btree-split-and-merge-strategy.md

btree-wal-replay-is-idempotent [IN] OBSERVATION

Because the B-tree WAL stores complete page images, replaying entries already applied to the data file produces identical results, making the crash window after sync() but before truncate() harmless

Source: entries/2026/05/29/topic-write-ahead-logging-fsync-ordering.md

btree-wal-replays-without-commit [IN] OBSERVATION

The B-tree's WAL replays all valid (CRC-passing) entries on recovery regardless of whether a commit marker is present — uncommitted entries are applied without distinction.

Source: entries/2026/05/28/b-tree-storage-engine-test_btree.md

btree-wal-three-phase-commit [IN] OBSERVATION

The B-tree WAL commit follows a strict three-phase protocol: (1) write WAL entries + fsync, (2) fsync data file via page_manager.sync(), (3) truncate WAL + fsync WAL fd — crash-safe at every interleaving because WAL entries are physical page images making replay idempotent.

Source: entries/2026/05/29/topic-crash-recovery-invariants.md

btree-wal-truncate-before-data-sync [IN] OBSERVATION

WAL.commit() truncates the WAL before the data file is fsynced, creating a crash window where both the log and the data can be lost.

Source: entries/2026/05/28/b-tree-storage-engine-fix-plan.md

btree-wal-truncate-is-safe [IN] OBSERVATION

The B-tree's WAL commit() only truncates after fsyncing the data file, making truncation idempotent — a crash mid-truncation simply replays the WAL entries again on recovery

Source: entries/2026/05/28/topic-truncation-crash-safety.md

btree-wal-truncation-is-commit-point [IN] OBSERVATION

The WAL truncation in commit() is the atomic linearization point: before truncation, recovery will redo the transaction; after truncation, the transaction is committed

Source: entries/2026/05/29/topic-split-atomicity-via-wal.md

btree-wal-uses-full-page-images [IN] OBSERVATION

The WAL logs complete page images (physical logging) rather than logical operations, which eliminates incomplete-split states during recovery but increases write amplification compared to PostgreSQL's logical WAL entries.

Source: entries/2026/05/29/topic-postgres-nbtree-readme.md

btree-wal-uses-trailer-crc [IN] OBSERVATION

The B-tree WAL places its CRC32 checksum as a trailer after page_data, while the standalone WAL and Bitcask embed CRC in the record header — a structural difference that affects torn-write detection behavior

Source: entries/2026/05/29/topic-crc-scope-comparison-across-implementations.md

buffer-bounded-by-window-plus-lateness [IN] OBSERVATION

Join buffer size is bounded by events within window.duration of the current watermark; expireevents garbage-collects everything below that cutoff on every event arrival or advance_time call

Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

bully-assumes-full-connectivity [IN] OBSERVATION

The Bully Algorithm requires any node to message any other node directly via allnodeids; there is no ring or tree overlay topology

Source: entries/2026/05/29/topic-bully-vs-ring-election.md

bully-candidate-half-timeout [IN] OBSERVATION

Candidates wait electiontimeout // 2 for ALIVE responses before deciding, shorter than the follower's full electiontimeout, to avoid cascading election delays.

Source: entries/2026/05/29/leader-election-leader_election.md

bully-cascade-causes-quadratic-messages [IN] OBSERVATION

Receiving an ELECTION from a lower-ID node triggers both an ALIVE response and a new start_election call, creating the O(n²) message cascade in worst-case elections

Source: entries/2026/05/29/topic-bully-vs-ring-election.md

bully-coordinator-is-broadcast [IN] OBSERVATION

declare_victory sends a COORDINATOR message to every other node in the cluster (O(n) messages), not just to nodes that participated in the election

Source: entries/2026/05/29/topic-bully-vs-ring-election.md

bully-election-terms-monotonic [IN] OBSERVATION

Election terms in getelectionhistory() are enforced to be monotonically non-decreasing across successive elections, validated by testtermsincrease_monotonically

Source: entries/2026/05/29/leader-election-testleaderelection.md

bully-highest-id-wins [IN] OBSERVATION

The Bully Algorithm always elects the highest available node ID as leader; startelection only sends ELECTION to higher-ID nodes, and declarevictory only fires when no higher node responds.

Source: entries/2026/05/29/leader-election-leader_election.md

bully-recovery-triggers-election [IN] OBSERVATION

When a failed node recovers via recover_node, it immediately starts an election, potentially preempting the current leader — this is the defining "bully" behavior.

Source: entries/2026/05/29/leader-election-leader_election.md

bully-split-brain-resolved-post-hoc [IN] OBSERVATION

Split-brain is detected and resolved by resolvesplit_brain in the cluster harness after each tick (forcing lower-ID leaders to re-elect), not prevented by the node-level protocol alone.

Source: entries/2026/05/29/leader-election-leader_election.md

bully-synchronous-delivery [IN] OBSERVATION

All messages generated within a single tick() call are delivered and fully resolved (including cascading responses) in a while all_messages loop before the tick returns.

Source: entries/2026/05/29/leader-election-leader_election.md

bully-tests-use-deterministic-simulation [IN] OBSERVATION

Leader election tests control time via an explicit starttime parameter to rununtil_leader() rather than real timers or sleeps, making the entire test suite deterministic with no flaky timing

Source: entries/2026/05/29/leader-election-testleaderelection.md

byzantine-mode-is-immutable-after-construction [IN] OBSERVATION

The byzantinemode field is set in init_ and never modified during protocol execution, so a node's fault behavior cannot change mid-protocol — simplifying reasoning about safety guarantees

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-applybyzantine.md

catch-up-then-subscribe [IN] OBSERVATION

LiveProjection implements the catch-up subscription pattern: replay historical events via readall(fromposition=...), then switch to push-based updates for future events, bridging pull/push at a known position with no gap

Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

catch-up-uses-event-id-filtering [IN] OBSERVATION

catchup() calls readall(fromposition=self.position + 1) to skip already-processed events, but onevent has no equivalent position guard — an asymmetry that enables duplicate processing

Source: entries/2026/05/29/topic-catch-up-subscription-gap.md

catch-up-uses-position-plus-one [IN] OBSERVATION

Projection.catchup() calls readall(fromposition=self.position + 1), so loading a snapshot sets the exact boundary — only events after the snapshot's position are replayed, making reconstruction cost proportional to events-since-snapshot

Source: entries/2026/05/29/topic-event-sourcing-snapshots.md

cbf-8x-memory-vs-standard [IN] OBSERVATION

CountingBloomFilter uses one byte per counter position (bytearray(self._m)) versus one bit per position in BloomFilter, an 8x memory overhead for deletion support

Source: entries/2026/05/29/topic-memtable-bloom-filter-lifecycle.md

cbf-count-tracks-calls-not-cardinality [IN] OBSERVATION

CountingBloomFilter._len_ returns add calls minus remove calls, not distinct element count; adding the same item twice and removing once leaves len == 1

Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

cbf-one-byte-per-counter [IN] OBSERVATION

Each CountingBloomFilter counter uses a full byte in a bytearray regardless of counter_bits, using 8x more memory than a bit-packed representation

Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

cbf-remove-can-introduce-false-negatives [IN] OBSERVATION

Unlike standard Bloom filters which never have false negatives, CountingBloomFilter.remove() can cause false negatives when two items share a hash position — removing one decrements the shared counter, potentially making the other appear absent

Source: entries/2026/05/29/bloom-filter-bloom_filter-remove.md

cbf-remove-false-accept [IN] OBSERVATION

CountingBloomFilter.remove can silently succeed for an item never added if all its hash positions have non-zero counters from other items, corrupting the filter state

Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

cbf-remove-raises-on-absent-item [IN] OBSERVATION

CountingBloomFilter.remove() raises ValueError if any hash position has a zero counter, providing a safety check against removing items that were never added

Source: entries/2026/05/29/topic-memtable-bloom-filter-lifecycle.md

cbf-remove-two-pass-atomicity [IN] OBSERVATION

CountingBloomFilter.remove() validates all hash positions have non-zero counters in a first pass before decrementing any in a second pass, preventing partial state corruption when removing an absent item

Source: entries/2026/05/29/bloom-filter-bloom_filter-remove.md

cbf-saturated-counters-are-permanent [IN] OBSERVATION

When a CountingBloomFilter counter reaches maxval (default 15), it is never decremented, preventing false negatives at the cost of a slight false positive rate increase

Source: entries/2026/05/29/topic-memtable-bloom-filter-lifecycle.md

cbf-saturated-counters-never-decrement [IN] OBSERVATION

When a CountingBloomFilter counter reaches maxval (default 15 for 4-bit counters), it is permanently frozen: neither add nor remove will change it

Source: entries/2026/05/29/bloom-filter-bloom_filter-CountingBloomFilter.md

cdc-and-event-sourcing-share-projection-pattern [IN] OBSERVATION

CDCConsumer/MaterializedView in cdc.py and Projection in event_store.py implement the same structural pattern — track position, pull new events, apply handlers — against different event sources (database mutations vs. domain events)

Source: entries/2026/05/29/topic-cqrs-read-models.md

cdc-backbone-both-heuristic-and-insufficient [IN] OBSERVATION

The CDC backbone that all derived systems depend on has two independent reliability gaps: event type semantics are determined by reconstruction heuristics rather than explicit markers (insert vs update distinguished by old_value presence, tombstones reported as None, snapshots use sentinel sequence numbers), AND the consistency requirements of derived systems (explicit flush to make mutations visible, old values for index maintenance) aren't reliably met by the CDC infrastructure.

cdc-before-after-contract [IN] OBSERVATION

Every ChangeEvent follows a strict before/after convention: INSERT has before=None, DELETE has after=None, UPDATE has both populated as full row copies, making events self-contained without querying the source database.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-consumer-poll-is-at-least-once [IN] OBSERVATION

If a CDC consumer crashes mid-poll(), the position is not advanced, so all events from that batch will be reprocessed on the next call — providing at-least-once delivery with no deduplication.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-consumer-position-is-next-sequence [IN] OBSERVATION

A CDC consumer's position tracks the *next* sequence number to read (not the last one read); after processing, it advances to lastevent.sequence_number + 1.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-consumer-position-is-volatile [IN] OBSERVATION

CDCConsumer tracks its read position as an in-memory integer (_position) with no durable offset storage, meaning position is lost on process restart

Source: entries/2026/05/29/topic-cdc-flush-semantics.md

cdc-determines-insert-vs-update-by-old-value [IN] OBSERVATION

The CDC layer distinguishes insert from update events by checking whether old_value is None; the WAL only knows PUT and DELETE, so the semantic enrichment happens at the CDC boundary

Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

cdc-event-semantics-depend-on-reconstruction-heuristics [IN] OBSERVATION

CDC event semantics are determined by reconstruction heuristics rather than explicit markers: insert vs update is distinguished by checking whether old_value is None, tombstones are reported as None values indistinguishable from missing data in conflict records, and snapshots use a sentinel sequence number (-1) to distinguish bootstrapped state from real events.

cdc-insert-update-distinction [IN] OBSERVATION

Insert vs. update is determined solely by whether CDCStream.emit() receives a non-None old_value; the WAL itself only records PUT operations for both cases

Source: entries/2026/05/29/unbundled-database-unbundled_database.md

cdc-log-compact-keeps-latest-per-key [IN] OBSERVATION

CDCLog.compact() retains only the most recent event per (table, key) pair, matching Kafka log compaction semantics but without concurrent-reader safety

Source: entries/2026/05/29/topic-cdc-flush-semantics.md

cdc-log-sequence-numbers-survive-compaction [IN] OBSERVATION

CDCLog.compact() preserves the original sequence numbers of surviving events; it does not renumber them, so consumer positions remain valid references after compaction.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-materialized-view-transform-none-deletes [IN] OBSERVATION

When a MaterializedView's transform function returns None for an INSERT or UPDATE event, the row is removed from the view rather than stored, acting as a filter on the derived table.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-old-value-required-for-index-consistency [IN] OBSERVATION

SecondaryIndex.processevent depends on CDCEvent.oldvalue to remove stale index entries during updates; without before-images, incremental index maintenance produces phantom references

Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

cdc-search-index-full-reindex-on-update [IN] OBSERVATION

SearchIndex removes all old tokens and re-adds all new tokens on every UPDATE event, even if only non-indexed columns changed — correct but not optimized for partial changes.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-snapshot-uses-sentinel-sequence-number [IN] OBSERVATION

createsnapshot assigns sequencenumber=-1 to all synthetic INSERT events, distinguishing bootstrapped state from real log entries and enabling new consumers to load current state then seek to the live tail.

Source: entries/2026/05/29/change-data-capture-cdc.md

cdc-write-and-log-are-synchronously-coupled [IN] OBSERVATION

Every CDCDatabase mutation (insert/update/delete) writes to in-memory state and appends to CDCLog in the same synchronous call with no failure boundary between them

Source: entries/2026/05/29/topic-cdc-flush-semantics.md

ch-md5-masked-to-32-bits [IN] OBSERVATION

The consistent hash ring's hash function truncates MD5's 128-bit output to 32 bits via & 0xFFFFFFFF, matching the RINGSIZE of 2^32

Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

ch-preference-list-skips-duplicates [IN] OBSERVATION

getnodes walks clockwise past virtual nodes belonging to already-seen physical nodes, ensuring exactly replicationfactor distinct physical nodes in the preference list

Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

ch-ring-sorted-invariant [IN] OBSERVATION

ConsistentHashRing.ringpositions is maintained in sorted order at all times via bisect; all key lookups depend on this invariant

Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

ch-transfer-maps-are-descriptive [IN] OBSERVATION

addnode and removenode return transfer map descriptions of which arcs move between nodes but do not move data themselves; the caller must act on the map

Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

ch-vnode-count-is-weighted [IN] OBSERVATION

A consistent hash ring node with weight w gets int(num_vnodes * w) virtual nodes, so weight=0.5 produces half the default ring presence

Source: entries/2026/05/29/consistent-hashing-consistent_hashing.md

chained-checksums-trade-random-access-for-log-integrity [IN] OBSERVATION

Chained frame checksums (as in SQLite's WAL) prevent mid-log splicing, deletion, and reordering but require sequential validation from the chain start, eliminating the ability to validate or replay a single record in isolation

Source: entries/2026/05/29/topic-cumulative-checksums.md

checkpoint-empty-payload [IN] OBSERVATION

Checkpoint records are encoded with zero-length key and value fields; the record's semantic role is carried entirely by its OP_CHECKPOINT type and sequence number

Source: entries/2026/05/29/write-ahead-log-wal-checkpoint.md

checkpoint-filtered-from-replay [IN] OBSERVATION

replay() filters out CHECKPOINT records (along with COMMIT records), returning only PUT and DELETE — checkpoint records serve as truncation boundaries, not data operations visible to recovery consumers

Source: entries/2026/05/29/write-ahead-log-wal-checkpoint.md

checkpoint-record-not-used-in-recovery [IN] OBSERVATION

OPCHECKPOINT records are written to the WAL but recovery (recoverseqnum) does not search for them; the caller must supply the checkpoint sequence number externally via replay(after_seq=)

Source: entries/2026/05/29/topic-log-compaction-vs-persistence.md

checkpoint-record-unused-for-truncation [IN] OBSERVATION

The WAL supports OP_CHECKPOINT records that could serve as a durable truncation watermark (skip records below the watermark during replay), but truncate() physically removes records instead of using this mechanism

Source: entries/2026/05/29/topic-crash-safety-of-truncate.md

checksum-mask-is-python2-compat [IN] OBSERVATION

The & 0xFFFFFFFF mask in WAL._checksum is a Python 2 portability idiom — Python 3's zlib.crc32 already returns unsigned values, making the mask technically redundant but kept as defensive code

Source: entries/2026/05/29/b-tree-storage-engine-btree-_checksum.md

client-holds-stale-token-after-lock-expiry [IN] OBSERVATION

The Client.heldtokens dict is never cleared on lock expiry — the client retains and may use a token whose corresponding lock has already been acquired by another client

Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

code-expert-has-no-codegen [IN] OBSERVATION

The code-expert workflow is read-only against the target repository — it scans, explores, and extracts beliefs but never generates or modifies source files

Source: entries/2026/05/29/topic-code-expert-generation-pipeline.md

codebase-uses-no-steal-policy [IN] OBSERVATION

All implementations use NO-STEAL buffer management: uncommitted transaction data is never written to the persistent data store, eliminating the need for undo logging at the cost of requiring all dirty data from active transactions to fit in memory

Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

combiner-does-not-affect-map-output-stats [IN] OBSERVATION

mapoutputrecords is incremented before the combiner runs (line 108), so stats reflect pre-combiner volume; the test at line 68 asserts this explicitly

Source: entries/2026/05/29/topic-combiner-correctness.md

combiner-not-type-checked [IN] OBSERVATION

The combiner's callable signature matches the reducer's, but no runtime or static check verifies associativity or commutativity — a non-associative combiner (e.g., average) produces silently wrong results that vary with num_mappers

Source: entries/2026/05/29/topic-combiner-correctness.md

combiner-output-indistinguishable-from-mapper-output [IN] OBSERVATION

The reducer reads combiner output from intermediate JSON files identically to raw mapper output, with no metadata distinguishing pre-aggregated values from raw emitted pairs

Source: entries/2026/05/29/topic-combiner-correctness.md

commit-aborts-on-conflict [IN] OBSERVATION

On write-write conflict, commit() calls abort(tx) internally and returns False, so callers never need to manually abort after a failed commit; the transaction is dead after a False return.

Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

commit-conflict-check-not-atomic [IN] OBSERVATION

The conflict-check-then-commit sequence in MVCCDatabase.commit() is not atomic — no locking protects the window between checking versions and marking committed, requiring single-threaded execution for correctness.

Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

commit-conflict-scans-all-versions [IN] OBSERVATION

The write-write conflict check in commit() scans all versions of each key in the transaction's write set (not just the latest), making conflict detection O(versions) per written key.

Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

commit-only-in-batch [IN] OBSERVATION

OPCOMMIT is written exclusively by appendbatch() in write-ahead-log/wal.py; single append() calls never produce a COMMIT record, so individual operations are their own atomic unit

Source: entries/2026/05/29/topic-checkpoint-vs-commit-semantics.md

commit-read-only-skip-conflict-check [IN] OBSERVATION

Read-only transactions bypass write-write conflict detection entirely in commit() and always commit successfully, since they cannot produce write-write conflicts.

Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

commit-timestamp-unused-by-visibility [IN] OBSERVATION

Commit timestamps are assigned by commit() and stored in committimestamps, but isvisible never references them — visibility is determined purely by transaction IDs and the committed/active-at-start sets.

Source: entries/2026/05/29/snapshot-isolation-mvcc_database-commit.md

compact-closes-handles-before-keydir-fully-updated [IN] OBSERVATION

In hash-index-storage/bitcask.py, cached file readers for old immutable files are closed at the start of the merge-write phase, before all keydir entries have been updated to point to new locations, creating a window where reads would fail even without concurrent access

Source: entries/2026/05/29/topic-concurrent-merge-safety.md

compact-deletes-old-sstable-files [IN] OBSERVATION

compact() removes old SSTable files from disk after merging, meaning any concurrent reader holding a reference to a deleted SSTable reads stale data (on Unix) or crashes (on Windows)

Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

compact-filters-against-live-index [IN] OBSERVATION

During log-structured-hash-table compaction, records in frozen segments are only kept if _index still points to that exact (path, offset), correctly excluding keys that were overwritten in the active segment after the frozen segment was closed.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

compact-newest-wins-by-seq [IN] OBSERVATION

During compaction, the entry with the highest SSTable sequence number wins for each key; sequence numbers are per-SSTable (all entries in one SSTable share the same seq), not per-entry

Source: entries/2026/05/28/log-structured-merge-tree-lsm-compact.md

compact-no-concurrency-safety [IN] OBSERVATION

compact() mutates self.sstables without locking; concurrent flush or get calls during compaction can produce incorrect state or lose newly flushed SSTables

Source: entries/2026/05/28/log-structured-merge-tree-lsm-compact.md

compact-not-crash-safe [IN] OBSERVATION

Log-structured-hash-table compaction is non-atomic: the sequence of write-new, update-index, delete-old, rename-active has no transaction boundary, and a crash mid-compaction can leave orphaned segments or a misnamed active file.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

compact-purges-tombstones [IN] OBSERVATION

Tombstones (b"") are permanently removed during compaction and never written to the output SSTable; deleted keys disappear entirely after merge

Source: entries/2026/05/28/log-structured-merge-tree-lsm-compact.md

compact-sets-base-offset-absolutely [IN] OBSERVATION

Topic.compactpartition overwrites base_offsets[partition] with the first surviving message's offset (absolute assignment), discarding whatever the previous base was, because compaction removes messages from arbitrary positions.

Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

compact-single-threaded-assumption [IN] OBSERVATION

Log-structured-hash-table compaction has no synchronization protecting the index, file handles, or segment counter during its multi-step mutation sequence; concurrent access during compaction would corrupt state.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

compact-skips-crc-validation [IN] OBSERVATION

Log-structured-hash-table compaction reads records from frozen segments without verifying CRC checksums, unlike scansegment which stops at the first CRC mismatch; a corrupted record could be silently copied into the compacted segment.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-compact.md

compaction-buffers-all-entries [IN] OBSERVATION

lsm.py:compact collects all merged entries into an in-memory list before writing the output SSTable, requiring O(n) memory proportional to total data rather than O(k) proportional to number of input SSTables

Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

compaction-crash-can-resurrect-deleted-keys [IN] OBSERVATION

A crash during LSM compaction after tombstones are stripped from the merged output but before old SSTables are deleted can resurrect previously-deleted keys, because the tombstone that suppressed them no longer exists in any surviving SSTable.

Source: entries/2026/05/29/topic-crash-recovery-testing.md

compaction-deletes-before-reader-release [IN] OBSERVATION

compact() deletes old SSTable files immediately after replacing self._sstables, with no mechanism to defer deletion until active readers holding references to the old list finish their iterations

Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

compaction-hazard-within-broken-durability-pipeline [IN] OBSERVATION

Compaction is the highest-risk operation yet operates within a durability pipeline broken at both ends: crash-unsafe compaction can permanently lose data or resurrect deletes, AND the surrounding infrastructure has fsync gaps (data may never reach disk) and unrecoverable integrity checks (corruption detected but not repaired) — the most dangerous operation has the least protection.

compaction-is-critical-data-lifecycle-hazard [IN] OBSERVATION

Compaction is the critical junction where crash safety and data lifecycle intersect: crash-unsafe compaction can permanently lose data or resurrect deleted keys (no write-temp/fsync/rename, delete-before-rename ordering), and fragmented tombstone semantics mean those corruptions propagate inconsistently through derived systems that require flush and old-value tracking for correctness.

compaction-is-explicit-not-background [IN] OBSERVATION

Both the LSM and SSTable modules trigger compaction via explicit synchronous method calls (compact() / run_compaction()), not background threads — removing the write-amplification pressure that motivates least-overlap selection in production systems.

Source: entries/2026/05/29/topic-leveled-compaction-write-amplification.md

compaction-is-unsalvageable-as-designed [IN] OBSERVATION

Compaction is unsalvageable as designed: it is the highest-risk operation within a durability pipeline broken at both ends (crash-unsafe writes that are unverifiable by testing), AND it triggers two independent failure modes under the concurrent access that production workloads produce (concurrent readers see inconsistent state, concurrent writers corrupt shared data structures), making it simultaneously the most critical and most dangerous operation.

compaction-lacks-crash-safety-across-implementations [IN] OBSERVATION

No storage engine implementation uses the write-temp/fsync/rename pattern for file creation, and both Bitcask implementations delete old segments before renaming replacements, creating crash windows where data exists in neither old nor new files.

compaction-manager-never-deletes-old-files [IN] OBSERVATION

CompactionManager removes compacted SSTables from the in-memory _sstables list but never deletes their underlying files on disk; cleanup is the caller's responsibility.

Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

compaction-manager-never-removes-tombstones [IN] OBSERVATION

The CompactionManager in sstable-and-compaction/sstable.py always calls mergesstables with the default removetombstones=False, making it conservatively correct but causing unbounded tombstone accumulation

Source: entries/2026/05/28/topic-tombstone-gc-safety.md

compaction-manager-supports-two-strategies [IN] OBSERVATION

CompactionManager implements both 'size_tiered' and 'leveled' compaction strategies, selected by a string parameter.

Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

compaction-not-atomic [IN] OBSERVATION

LSMTree.compact() performs a multi-step file swap (write new SSTable, update in-memory list, delete old files) with no mechanism to make the transition atomic across crashes

Source: entries/2026/05/28/topic-manifest-based-compaction.md

compaction-preserves-original-offsets [IN] OBSERVATION

After compact_partition, surviving messages retain their original offset values, creating non-contiguous gaps in the offset sequence; offsets are never renumbered.

Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

compaction-propagates-corruption [IN] OBSERVATION

hash-index-storage/bitcask.py:compact copies records without integrity validation, so silently corrupted data survives compaction into new files

Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

compaction-respects-size-limit [IN] OBSERVATION

Bitcask compaction output is itself subject to maxfilesize rotation, so smaller size limits produce more post-compaction files — feeding back into recovery cost.

Source: entries/2026/05/29/topic-wal-segment-sizing-tradeoffs.md

compaction-result-assigned-level-1 [IN] OBSERVATION

CompactionManager.run_compaction with leveled strategy assigns merged output to level 1 but does not verify that the new file's key range is disjoint from existing L1 files

Source: entries/2026/05/29/topic-compaction-output-splitting.md

compaction-strategy-vs-scheduling-decoupled [IN] OBSERVATION

The sstable.py CompactionManager separates *which* SSTables to merge (size-tiered vs. leveled strategy) from *when* to merge, but neither the manager nor any caller implements scheduling, rate-limiting, or load-aware deferral logic.

Source: entries/2026/05/29/topic-bitcask-merge-window-scheduling.md

compaction-threshold-controls-overlap-window [IN] OBSERVATION

The compaction_threshold parameter (lsm.py:204) directly controls how many overlapping SSTables can accumulate before compaction, setting the worst-case missing-key probe count to threshold - 1

Source: entries/2026/05/29/topic-size-tiered-vs-leveled-read-amplification.md

compaction-triggered-by-sstable-count [IN] OBSERVATION

LSM compaction runs automatically when len(self.sstables) >= self.compactionthreshold (default 4), triggered at the end of flush after the new SSTable is registered

Source: entries/2026/05/29/log-structured-merge-tree-lsm-_flush.md

compare-returns-four-states [IN] OBSERVATION

VectorClock.compare returns exactly one of BEFORE, AFTER, EQUAL, or CONCURRENT, implementing the partial order defined by component-wise comparison.

Source: entries/2026/05/29/vector-clocks-vector_clock.md

concurrency-check-is-pre-mutation [IN] OBSERVATION

The expectedversion optimistic concurrency check in appendbatch runs before any state mutation, so a ConcurrencyConflict exception is safe and leaves the store completely unchanged.

Source: entries/2026/05/29/event-sourcing-store-eventstore-appendbatch.md

concurrency-unsafe-on-both-read-and-write-paths [IN] OBSERVATION

Concurrency safety is absent from both mutation and query paths: core components (B-tree, garbage collector, consistent hash ring) silently assume single-threaded write access with no locks or assertions, and range scans lack snapshot isolation, cycle guards, or concurrent modification protection, meaning concurrent workloads can corrupt state through both writes and reads independently.

concurrent-access-during-compaction-is-doubly-unsafe [IN] OBSERVATION

Compaction under concurrent access is unsafe in two independent failure modes: concurrent readers and writers have no synchronization (no locks, latches, or snapshots on either path), AND compaction itself has no crash-safe file operations — concurrent access can corrupt live state silently, and a crash during this unsynchronized compaction produces irrecoverable data loss.

conflict-requires-different-origins [IN] OBSERVATION

A conflict in applyremotechange is only detected when localorigin != remotenode; two writes from the same origin at different timestamps are treated as sequential updates and resolved by tuple comparison without generating a ConflictRecord.

Source: entries/2026/05/29/multi-leader-replication-multileader-applyremote_change.md

conflict-resolution-architecture-is-split [IN] OBSERVATION

Conflict resolution is implemented in two disconnected modules: CRDTs encode resolution in their merge semantics (self-resolving), while the strategy enum covers only LWW and custom-merge — leaving no unified API and no bridge between the two approaches.

consensus-and-membership-use-incompatible-convergence-models [IN] OBSERVATION

Gossip-based failure detection and Raft consensus interact in a way that can compound partition hazards: gossip's timeout-driven liveness set determines cluster membership, while Raft's partition behavior means an isolated leader silently accepts uncommittable writes and its inflated term forces re-election upon rejoining — if gossip's failure detection misclassifies a partitioned-but-live leader, it may trigger membership changes that interact with Raft's already-disruptive partition recovery.

consensus-architectures-are-inverse-optimizations [IN] OBSERVATION

Raft and Total Order Broadcast represent contrasting approaches to Paxos optimization: Raft centralizes proposal authority in a single elected leader (one election per term, then Phase-2-only replication), eliminating dueling proposals within a term, while TOB allows any node to propose for any slot using full two-phase Paxos each time, making it susceptible to competing proposals. These represent different points in the leader-based versus leaderless tradeoff space.

consensus-correctness-doubly-unverified [IN] OBSERVATION

Consensus protocol correctness is doubly unverified: both Raft and TOB represent untested inverse optimizations of Multi-Paxos whose safety properties diverge specifically under asynchrony and crash failures, AND the testing methodology covers neither crash nor asynchronous failure modes — the exact conditions under which the two optimization strategies would reveal different safety profiles.

consensus-safety-unverified-across-optimization-variants [IN] OBSERVATION

Both consensus mechanisms represent inverse optimizations of Multi-Paxos (Raft centralizes proposal authority in a leader, TOB decentralizes it to allow any node to propose), but neither variant's safety has been validated under the asynchronous conditions it is designed for — all protocol tests use deterministic synchronous delivery with no real network I/O, message loss, or reordering.

consistent-hash-add-node-is-idempotent [IN] OBSERVATION

Adding a node that already exists silently returns an empty transfer map and does not change the ring's node count, vnode positions, or key assignments.

Source: entries/2026/05/29/consistent-hashing-testconsistenthashing.md

consistent-hash-default-150-vnodes [IN] OBSERVATION

ConsistentHashRing defaults to 150 virtual nodes per physical node, and exposes loadimbalance (maxload / avg_load) to measure distribution quality

Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

consistent-hash-get-nodes-first-equals-get-node [IN] OBSERVATION

getnodes(key)[0] always equals getnode(key) — the preference list's head is the primary replica.

Source: entries/2026/05/29/consistent-hashing-testconsistenthashing.md

consistent-hash-minimal-redistribution [IN] OBSERVATION

Adding an Nth node to the ring moves approximately 1/N of keys, not a full reshuffle — the defining property that makes consistent hashing useful for dynamic cluster membership.

Source: entries/2026/05/29/consistent-hashing-testconsistenthashing.md

consistent-hash-replication-requires-sufficient-physical-nodes [IN] OBSERVATION

getnodes() raises ValueError rather than silently under-replicating when replicationfactor exceeds the number of physical nodes on the ring.

Source: entries/2026/05/29/consistent-hashing-testconsistenthashing.md

consistent-hash-ring-lookup-is-olog-v [IN] OBSERVATION

Key lookup via get_node is O(log V) where V is total virtual nodes, using bisect on the sorted position list; node addition is O(V) per vnode due to list.insert.

Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

consistent-hash-ring-no-positive-validation [IN] OBSERVATION

ConsistentHashRing does not validate that numvnodes, weight, or replicationfactor are positive; zero or negative values silently produce degenerate ring states (e.g., a node with zero vnodes exists in _nodes but owns no ring arc).

Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

consistent-hash-ring-not-thread-safe [IN] OBSERVATION

ConsistentHashRing has no synchronization; concurrent addnode/removenode calls corrupt the sorted ringpositions and ringnodes lists.

Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

consistent-hash-ring-sorted-invariant [IN] OBSERVATION

ringpositions is maintained in sorted order at all times via bisect insertion; ringnodes is kept parallel to it with len(ringpositions) == len(ringnodes) always holding.

Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

consistent-hash-ring-uses-32-bit-space [IN] OBSERVATION

The ring's hash space is [0, 2^32), as validated by testringpositionvalidrange.

Source: entries/2026/05/29/consistent-hashing-testconsistenthashing.md

consistent-hash-ring-uses-md5 [IN] OBSERVATION

All consistent hash ring positions are computed via MD5 truncated to 32 bits; the hash function is not pluggable

Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

consistent-hash-ring-uses-md5-truncated-to-32-bit [IN] OBSERVATION

The ring hashes keys and vnode identifiers using hashlib.md5 truncated to 32 bits via & 0xFFFFFFFF, chosen for distribution quality rather than security.

Source: entries/2026/05/29/consistent-hashing-consistent_hashing-ConsistentHashRing.md

consistent-hash-weight-scales-vnodes [IN] OBSERVATION

The weight parameter on add_node scales virtual node count proportionally, supporting heterogeneous nodes with different capacities

Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

consumer-group-offset-resume [IN] OBSERVATION

A new consumer joining a group loads committed offsets from the broker (tested at testpartitionedlog.py:152), enabling offset-based resume but also duplicate redelivery on crash

Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

consumer-independent-positions [IN] OBSERVATION

Each DerivedSystem tracks its own LSN position independently via _position, allowing consumers to fall behind or catch up at different rates without blocking each other

Source: entries/2026/05/29/unbundled-database-unbundled_database.md

context-parameter-establishes-causality [IN] OBSERVATION

Passing a vector clock as context to VersionedKVStore.put declares the write descends from that version; omitting context treats the write as concurrent with all existing versions

Source: entries/2026/05/29/vector-clocks-testvectorclock.md

convergence-rate-asymmetry-membership-vs-data [IN] OBSERVATION

Membership and data convergence operate at fundamentally different rates: gossip-based membership changes propagate in O(log N) rounds via epidemic-style random peer selection, but data convergence in ring topology requires O(N) sync rounds because each round advances changes by exactly one hop via store-and-forward requeuing — creating a window proportional to cluster size where the membership view is current but data remains stale.

correctness-gap-widens-under-failure [IN] OBSERVATION

The gap between expected and actual system correctness widens under failure: distributed protocols require storage-layer guarantees (crash-safe compaction, atomic writes, complete CRC coverage) that no implementation provides, and the resulting divergence accumulates without bound because anti-entropy can detect but not fully reconcile the inconsistencies.

correctness-unachievable-and-unfalsifiable [IN] OBSERVATION

System correctness is both unachievable and unfalsifiable: the gap between specification and implementation widens under every failure mode (distributed protocols require unmet storage guarantees, replica divergence accumulates without bound, storage degrades monotonically), and the testing methodology cannot detect these gaps (crash paths untested, protocols validated only under synchronous simulation) — the system cannot be correct and cannot discover that it is not.

corruption-is-terminal-across-all-readers [IN] OBSERVATION

Corruption is a terminal condition across every binary record reader in the codebase: each stops at the first CRC failure with no resync capability, and the WAL specifically halts all replay rather than skipping the corrupt record, meaning any single corrupt byte truncates the recoverable history.

corruption-propagates-through-all-data-pipelines [IN] OBSERVATION

Corruption propagates silently through every data transformation pipeline: both Bitcask compaction implementations copy records without CRC validation, and hint file generation reads source records without integrity checks — every transformation step amplifies corruption rather than filtering it.

corruption-terminates-scan [IN] OBSERVATION

In scansegment, a CRC mismatch or short read stops scanning the entire segment; records after the corrupt point are silently lost from the index even if they are individually valid.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-scansegment.md

count-is-barrier [IN] OBSERVATION

Count stage materializes its entire input into a defaultdict before yielding any output, making it a pipeline barrier that forces all upstream stages to complete first

Source: entries/2026/05/29/batch-word-count-pipeline.md

counting-bloom-4bit-safe-at-capacity [IN] OBSERVATION

At designed capacity (n ≤ expected_items), the expected number of saturated 4-bit counters is effectively zero because the per-position load follows Poisson(ln 2 ≈ 0.693), making P(count ≥ 15) ≈ 10⁻¹⁵

Source: entries/2026/05/29/topic-counter-saturation-probability.md

counting-bloom-filter-reference-counted [IN] OBSERVATION

CountingBloomFilter uses reference-counting semantics: an item added N times requires N remove() calls before it tests as absent via _contains_

Source: entries/2026/05/29/bloom-filter-testbloomfilter.md

counting-bloom-overload-threshold [IN] OBSERVATION

At 10× overload (10× expected_items inserted), roughly 0.5% of CountingBloomFilter counters saturate; at 20× overload, approximately half saturate, effectively degrading the filter to a non-counting Bloom filter

Source: entries/2026/05/29/topic-counter-saturation-probability.md

counting-bloom-remove-needs-drop-tracking [IN] OBSERVATION

Using CountingBloomFilter.remove() during compaction would require merge_sstables to report which keys were discarded, which the current implementation does not do — duplicates and tombstones are silently skipped in the merge loop

Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

counting-bloom-supports-deletion [IN] OBSERVATION

CountingBloomFilter uses saturating 4-bit counters so entries can be removed via remove(), which standard BloomFilter cannot do — relevant for maintaining filters over mutable data structures

Source: entries/2026/05/28/topic-bloom-filter-integration.md

counting-bloom-trades-correctness-for-unneeded-capability [IN] OBSERVATION

The counting Bloom filter pays 8x memory overhead and introduces false negatives through its removal operation for a capability (element deletion) that the primary use case — immutable SSTables — never needs, since SSTables are write-once and discarded whole during compaction.

counting-bloom-useful-for-memtable [IN] OBSERVATION

CountingBloomFilter.remove() is architecturally suited for maintaining filters over mutable in-memory structures (memtables) where keys can be overwritten without rescanning, not for SSTable compaction where the entire file is rewritten

Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

counting-bloom-uses-byte-per-counter [IN] OBSERVATION

CountingBloomFilter allocates one full byte per counter position (bytearray(self.m) at line 112) despite counterbits defaulting to 4, using 8x the space of a standard BloomFilter's bit array rather than the expected 4x.

Source: entries/2026/05/29/topic-cuckoo-filters.md

crash-failure-paths-systematically-untested [IN] OBSERVATION

Crash and failure recovery paths are systematically excluded from the test suite: the WAL has no tests for truncated records or CRC mismatches, LSM crash testing covers only WAL replay and ignores compaction crashes entirely, and SSI write-skew tests exist only in standalone tester files outside the default pytest runner — the most critical correctness scenarios have the least test coverage.

crash-recovery-both-broken-and-unverified [IN] OBSERVATION

Crash recovery is simultaneously broken and unverified: no storage engine has a safe crash recovery path (non-atomic compaction, batch-blind replay, metadata-excluding CRC), and crash/failure paths are systematically excluded from the test suite — recovery bugs will persist indefinitely because neither the broken mechanisms nor their absence is tested.

crc-detects-not-prevents-data-loss [IN] OBSERVATION

CRC32 checksums in the WAL and Bitcask detect partial writes after a crash but cannot recover the lost data — they convert silent corruption into detected data loss, not into a recoverable state

Source: entries/2026/05/29/topic-crash-safety-of-truncate.md

crc-mask-is-python2-artifact [IN] OBSERVATION

The & 0xFFFFFFFF mask applied after zlib.crc32 in all modules is a Python 2 compatibility artifact; Python 3's zlib.crc32 already returns an unsigned 32-bit integer, making the mask harmless but unnecessary.

Source: entries/2026/05/29/topic-crc32-vs-crc32c.md

crc-no-cross-file-atomicity [IN] OBSERVATION

The CRC32 integrity checks in log-structured-hash-table/bitcask.py detect corrupt records within a segment but provide no protection against the cross-file atomicity problem during compaction.

Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

crc-polynomial-change-breaks-wire-format [IN] OBSERVATION

Switching from ISO 3309 (zlib.crc32) to Castagnoli (CRC-32C) is a backward-incompatible wire format change; existing data files would fail CRC verification without a format version mechanism in the file header.

Source: entries/2026/05/29/topic-crc32-vs-crc32c.md

crc32-covers-records-not-files [IN] OBSERVATION

CRC32 is computed per individual record or page (typically hundreds of bytes to a few KB), not per file; file-level corruption is detected only indirectly by failing to decode a record, not by any file-wide checksum

Source: entries/2026/05/29/topic-crc32-vs-xxhash-choice.md

crdt-eq-compares-semantic-state [IN] OBSERVATION

All four CRDT types implement _eq to compare semantic state (not object identity), enabling CRDTReplicaGroup.allconverged() and the semilattice property tests to verify convergence via equality checks.

Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

crdt-merge-is-idempotent [IN] OBSERVATION

All four CRDT merge methods (GCounter, PNCounter, LWWRegister, ORSet) are idempotent: merging the same state twice produces the same result as merging once

Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

crdt-merge-is-idempotent-and-convergence-tested [IN] OBSERVATION

All four CRDT types demonstrate idempotent merge (re-merging produces no change), semantic equality comparison, and monotonic ORSet tombstones. The sync_all test confirms convergence after two rounds, consistent with merge being commutative and associative, though these properties are exercised by tests rather than proven by the antecedents alone.

crdt-merge-returns-self [IN] OBSERVATION

merge() mutates and returns self (enabling chaining like deepcopy(a).merge(b)) rather than returning a new instance, which means callers must deepcopy before merging if they need to preserve the original state.

Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

crdt-mutation-api-is-non-defensive [IN] OBSERVATION

CRDT mutation APIs lack defensive programming patterns at the boundaries: merge mutates the receiver in place and returns self (requiring callers to defensively copy before merging to prevent aliasing), and remove operations on absent elements silently return False rather than raising errors, making accidental no-ops invisible.

crdts-are-self-resolving [IN] OBSERVATION

Each CRDT type in crdts.py (GCounter, PNCounter, LWWRegister, ORSet) encodes conflict resolution in its own merge() method, requiring no external strategy enum or callback

Source: entries/2026/05/29/topic-conflict-resolution-strategies.md

crdts-are-state-based-cvrdts [IN] OBSERVATION

All four CRDT types (GCounter, PNCounter, LWWRegister, ORSet) use state-based replication via a merge() method that implements a join-semilattice; no operation log or causal delivery is used

Source: entries/2026/05/29/topic-delta-state-crdts.md

critical-wal-operations-always-force-fsync [IN] OBSERVATION

Checkpoints, batch commits, and segment rotations all bypass the configured sync mode by calling fsync unconditionally, creating a two-tier durability model where structural WAL operations are always durable even when individual record writes are not.

data-distribution-and-processing-hit-independent-scalability-ceilings [IN] OBSERVATION

Data distribution and processing both hit fundamental scalability ceilings: neither partitioning strategy provides both reliable routing and ordered access (hash destroys order, range depends on unverified parallel-array invariants), while query and aggregation operations hide unbounded memory barriers behind uniform tuple interfaces, meaning neither the data layout nor the query path scales gracefully under load.

data-scan-skips-value-bytes [IN] OBSERVATION

scandatafile in the hash-index Bitcask reads each record's header and key but seeks past value bytes; it is still O(datasize) because headers must be parsed sequentially to locate record boundaries

Source: entries/2026/05/29/topic-bitcask-startup-cost.md

ddia-config-via-constructor-params [IN] OBSERVATION

All 37 modules configure behavior entirely through constructor parameters (e.g., maxfilesize, syncmode, pagesize) — there are no config files, environment variables, or settings modules.

Source: entries/2026/05/29/repo-overview.md

ddia-modules-self-contained [IN] OBSERVATION

Each top-level directory is a self-contained module with no cross-module imports; implementations are intentionally standalone so each concept can be understood in isolation.

Source: entries/2026/05/28/scan-ddia-implementations.md

ddia-pure-python-stdlib [IN] OBSERVATION

All implementations are Python; the storage engine modules use only stdlib except for sortedcontainers in the LSM tree — no frameworks or production infrastructure.

Source: entries/2026/05/28/scan-ddia-implementations.md

ddia-tester-files-stdout-validation [IN] OBSERVATION

The testertest*.py files are designed as standalone scripts that print "testname PASSED" / "testname FAILED" to stdout and include if _name == "main_" blocks, enabling automated verification without pytest.

Source: entries/2026/05/29/repo-overview.md

ddia-three-file-pattern [IN] OBSERVATION

Most modules follow a consistent three-file pattern: one implementation file, one test file, and a testertest*.py file (likely a meta-test or test harness validator).

Source: entries/2026/05/28/scan-ddia-implementations.md

decode-with-id-no-format-validation [IN] OBSERVATION

decodewithid does not validate the message format before parsing; any 4+ byte input will be interpreted as schema-ID-prefixed data, producing a confusing KeyError on invalid input rather than a clear format error

Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

deepcopy-prevents-snapshot-corruption [IN] OBSERVATION

savesnapshot() uses copy.deepcopy on the projection state dict to create an independent copy, ensuring subsequent event processing during catchup() does not mutate the saved snapshot

Source: entries/2026/05/29/topic-event-sourcing-snapshots.md

degradation-is-irreversible [IN] OBSERVATION

System degradation is irreversible: the system degrades monotonically at every abstraction level with no self-healing (leaked pages, growing tree height, accumulating divergence), and no subsystem at any architectural tier has a viable recovery strategy to arrest the erosion — recovery infrastructure is either vestigial, paradoxically over-engineered, or fundamentally missing.

delete-before-rename-ordering [IN] OBSERVATION

Both Bitcask implementations delete old segment files (os.remove) before renaming the compacted replacement, creating a crash window where neither old nor properly-named new data is on disk

Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

delete-semantics-fragmented-from-storage-to-derived-systems [IN] OBSERVATION

Delete propagation is fragmented end-to-end: tombstone representations differ at every storage layer (ambiguous empty-bytes sentinel, premature compaction purging, replication-dependent lifetime), and derived systems require explicit flush plus old-value CDC events that inconsistent tombstones cannot reliably provide.

deletion-visibility-is-symmetric [IN] OBSERVATION

The same three-condition visibility rule (committed, not in activeatstart, lower txid) is applied independently to both the creating and deleting transaction of a version — is_visible runs the same logic in two phases.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-is_visible.md

derived-system-consistency-requires-flush-and-old-values [IN] OBSERVATION

Derived systems (secondary indexes, materialized views) require two independent conditions for consistency: explicit flush to make mutations visible, and CDC old-value capture to remove stale index entries — either condition failing silently produces stale reads.

derived-systems-are-position-tracked [IN] OBSERVATION

Every DerivedSystem in the unbundled database tracks a position cursor indicating how far through the CDC log it has consumed, enabling independent catch-up and ensuring consumers can resume from where they left off

Source: entries/2026/05/29/topic-ddia-ch12-unbundling.md

derived-systems-depend-on-unreliable-event-infrastructure [IN] OBSERVATION

The derived-system pattern (secondary indexes, materialized views, projections) depends on event infrastructure that is unreliable at its foundation: event sourcing has no durable recovery mechanism and conflates two ID spaces, while derived systems require explicit flush calls and old-value CDC events for consistency — the consumer-side correctness requirements depend on producer-side guarantees that do not hold.

derived-systems-independently-position-tracked [IN] OBSERVATION

Each DerivedSystem tracks its own LSN position independently, allowing consumers to fall behind or catch up at different rates without coordinator state or blocking

Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

derived-systems-rebuildable-from-cdc [IN] OBSERVATION

Every DerivedSystem implements rebuild(events) that clears state and replays from scratch, guaranteeing eventual convergence with the CDC event log regardless of prior state

Source: entries/2026/05/29/unbundled-database-unbundled_database.md

deterministic-serialization-enables-independent-verification [IN] OBSERVATION

json.dumps(sort_keys=True) ensures all honest PBFT nodes produce identical SHA-256 digests for identical requests without inter-node coordination, transforming trust verification into a deterministic computation

Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

digest-binds-view-sequence-to-request [IN] OBSERVATION

The accepted_preprepare dict enforces a one-to-one mapping from (view, sequence) to digest, preventing a Byzantine primary from assigning two different requests to the same protocol slot

Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

digest-recomputed-on-preprepare [IN] OBSERVATION

Backup replicas independently recompute the SHA-256 digest from the request payload in every PRE_PREPARE message; they never trust the primary's claimed digest value

Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

directory-scan-recovery-unsafe-during-compaction [IN] OBSERVATION

loadexisting_sstables using os.listdir cannot distinguish pre-compaction from post-compaction file sets after a crash — if the crash occurs after creating the merged SSTable but before deleting the inputs, recovery sees duplicates; the inverse loses data.

Source: entries/2026/05/29/topic-manifest-and-crash-recovery.md

distributed-cluster-lacks-reliable-infrastructure [IN] OBSERVATION

The distributed cluster has no reliable foundational infrastructure: membership detection is unreliable in both accuracy (no adaptive thresholds) and propagation (asymmetric convergence rates between membership and data), and ordering infrastructure is broken at every layer (vestigial WAL sequence numbers, conflated event sourcing ID spaces, volatile CDC consumer positions), meaning neither who is in the cluster nor what order things happened can be answered reliably.

distributed-correctness-doubly-unachievable-under-partition [IN] OBSERVATION

Distributed correctness is doubly unachievable under network partitions: protocols require storage-layer guarantees (crash-safe compaction, CRC-protected metadata) that aren't met, AND partitions amplify the resulting gaps through disrupted gossip-based failure detection and stale leader writes — the prerequisites for correctness are absent even before partitions introduce additional failure modes.

distributed-correctness-undermined-at-both-layers [IN] OBSERVATION

Distributed system correctness is undermined at both the storage and protocol layers: storage engines silently assume single-threaded access (no locks, no assertions, no documentation), while the quorum protocol weakens its own semantic guarantees (counting hint storage as successful writes, allowing sub-quorum configurations without error).

distributed-divergence-accumulates-without-bound [IN] OBSERVATION

Replica divergence accumulates without bound: write operations have compounding correctness gaps (sloppy quorums count hints, sub-quorum configs accepted, conflict resolution split across modules), and the repair mechanism (Merkle-based anti-entropy) cannot fully reconcile because tombstone semantics differ at every layer.

distributed-protocols-rest-on-unverifiable-assumptions [IN] OBSERVATION

Distributed protocols rest on doubly invalid foundations: end-to-end correctness requires storage-layer guarantees (crash-safe compaction, CRC-protected metadata) that no storage engine provides, and protocol safety claims are unfalsifiable under the current testing methodology (synchronous simulation, no crash path tests) — the protocols assume both correct storage and correct testing, and have neither.

distributed-protocols-simulate-synchronous-delivery [IN] OBSERVATION

All distributed protocol implementations use synchronous message delivery: PBFT runs the full three-phase protocol in a single deterministic call, bully elections resolve cascading responses within one tick, and Lamport clocks deliver messages in the same call stack — none model the network asynchrony that is the core difficulty of distributed systems.

distributed-tombstone-gc-requires-downtime-bound [IN] OBSERVATION

Any tombstone garbage collection strategy must define a maximum tolerated node downtime; tombstones removed before a down node receives them cause data resurrection via read repair or anti-entropy.

Source: entries/2026/05/29/topic-tombstone-gc-and-repair-window.md

distributed-tombstone-removal-needs-replication-convergence [IN] OBSERVATION

In the multi-leader replication module, tombstones cannot be safely removed until all replicas have received the delete, adding a replication-convergence constraint beyond the local compaction-coverage constraint

Source: entries/2026/05/28/topic-tombstone-gc-safety.md

distributed-writes-have-compounding-correctness-gaps [IN] OBSERVATION

Distributed write correctness has compounding weaknesses: quorum semantics are weakened by sloppy quorum and permissive configuration, and the conflict resolution that should catch remaining inconsistencies is split between two disconnected mechanisms (self-resolving CRDTs and an incomplete strategy enum).

do-sync-force-bypasses-sync-mode [IN] OBSERVATION

dosync(force=True) always calls fsync regardless of the configured sync_mode, ensuring durability-critical operations like segment rotation and WAL close are never silently skipped

Source: entries/2026/05/29/topic-sync-mode-none-safety.md

doc-partitioned-query-touches-all [IN] OBSERVATION

DocumentPartitionedDB.querybyfield always iterates every partition regardless of result count, touching exactly num_partitions partitions (scatter/gather with no short-circuit).

Source: entries/2026/05/29/secondary-index-partitioning-secondaryindexpartitioning.md

doc-partitioned-write-touches-one [IN] OBSERVATION

DocumentPartitionedDB.put() always touches exactly 1 partition regardless of how many fields are indexed, because the secondary index is co-located with the document on its home partition.

Source: entries/2026/05/29/secondary-index-partitioning-testsecondaryindex_partitioning.md

durability-bugs-invisible-to-testing [IN] OBSERVATION

Durability bugs are permanently invisible: the write-to-verify durability pipeline is broken at both ends (fsync policy gaps prevent data from reaching stable storage, and incomplete integrity checks cannot verify it arrived), while crash/failure recovery paths are systematically excluded from testing — the system cannot detect its own durability failures through any available mechanism.

durability-pipeline-broken-at-both-ends [IN] OBSERVATION

The write-to-verify durability pipeline is broken at both ends: fsync policy inconsistencies across critical paths mean data may never durably reach disk (B-tree skips fsync for structural metadata while paying double for data), and incomplete CRC coverage means corrupted data that does reach disk passes integrity checks undetected (payload-only CRC leaves routing metadata unprotected).

dynamo-cluster-hints-after-quorum [IN] OBSERVATION

DynamoCluster.put only stores hints after the write quorum is already met on real nodes, unlike HintedHandoffStore which counts hints toward the quorum — making DynamoCluster a strict-quorum-with-opportunistic-hints design, not a true sloppy quorum

Source: entries/2026/05/29/topic-sloppy-quorum-tradeoffs.md

dynamo-conflict-detection-same-version-different-values [IN] OBSERVATION

When multiple replicas hold different values at the same max version number, ReadResult.is_conflict is True and value is a list of all conflicting values rather than a single resolved value

Source: entries/2026/05/29/leaderless-replication-testdynamotester.md

dynamo-failed-write-rolls-back-version [IN] OBSERVATION

When put() fails to meet write quorum, it decrements the version counter to prevent version gaps, then raises QuorumNotMet

Source: entries/2026/05/29/leaderless-replication-dynamo.md

dynamo-hints-single-homed [IN] OBSERVATION

Hinted handoffs for all unavailable nodes are stored on available_nodes[0] only, creating a single point of responsibility for hint delivery and a potential hotspot

Source: entries/2026/05/29/leaderless-replication-dynamo.md

dynamo-no-delete-support [IN] OBSERVATION

The leaderless replication implementation (dynamo.py) has no delete or tombstone mechanism; adding deletes without tombstones would cause resurrection via read repair or anti-entropy

Source: entries/2026/05/29/topic-sstable-tombstone-gc-safety.md

dynamo-per-key-versioning [IN] OBSERVATION

Version counters in DynamoCluster.versioncounters are scoped per key, not global across the cluster; independent keys maintain independent version sequences

Source: entries/2026/05/29/leaderless-replication-testdynamotester.md

dynamo-read-repair-is-all-node [IN] OBSERVATION

DynamoCluster.get() repairs all available nodes after every read, not just the quorum participants, trading higher per-read write fan-out for faster convergence compared to ReadRepairStore.

Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

dynamo-read-repair-is-eager [IN] OBSERVATION

DynamoCluster.get() repairs all stale replicas reachable during the read, not just enough to satisfy the quorum — every replica with a version below the max is updated

Source: entries/2026/05/29/leaderless-replication-dynamo.md

dynamo-sloppy-quorum-opt-in [IN] OBSERVATION

Hinted handoff only occurs when DynamoCluster is constructed with sloppyquorum=True; without it, deliverhints() returns 0 and offline nodes receive no data until anti-entropy runs

Source: entries/2026/05/29/leaderless-replication-testdynamotester.md

dynamo-version-counter-is-global [IN] OBSERVATION

Version assignment uses a single coordinator-level counter per key (versioncounters), not per-replica counters, which prevents version conflicts but requires a centralized coordinator

Source: entries/2026/05/29/leaderless-replication-dynamo.md

dynamo-write-fans-to-all [IN] OBSERVATION

DynamoCluster.put() sends writes to every available node, not just W of them; the quorum check gates on acknowledgment count, not send count, maximizing replica consistency

Source: entries/2026/05/29/leaderless-replication-dynamo.md

empty-predicate-match-still-locks [IN] OBSERVATION

A predicate that matches zero keys still appends a predicate lock to tx.predicatelocks, enabling phantom detection if a concurrent transaction later inserts a key that would match

Source: entries/2026/05/29/write-skew-detection-ssidatabase-readpredicate.md

encode-with-id-no-magic-byte [IN] OBSERVATION

SchemaRegistry.encodewithid uses a raw 4-byte big-endian schema ID prefix with no magic byte, unlike Confluent's 5-byte header (0x00 + 4-byte ID), so there is no format versioning for future wire format changes

Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

end-to-end-correctness-requires-unmet-storage-guarantees [IN] OBSERVATION

End-to-end distributed correctness is unachievable: protocol-layer weaknesses (sloppy quorums, single-threaded assumptions) depend on storage-layer guarantees (atomic recovery, batch integrity, metadata checksums) that no implementation provides.

entry-count-header-never-verified [IN] OBSERVATION

The entry count stored in the SSTable header by SSTableWriter.finish() is trusted by SSTableReader but never validated against the actual number of entries read from the data section

Source: entries/2026/05/29/topic-sstable-magic-number-vs-crc.md

equivocating-mode-expands-message-count [IN] OBSERVATION

In EQUIVOCATING mode, each broadcast message is replaced by N-1 targeted messages with per-peer distinct digests computed via compute_digest, meaning the output list can be larger than the input

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-applybyzantine.md

es-event-ids-are-stream-scoped [IN] OBSERVATION

Event IDs are sequential integers per-stream starting at 1, not globally unique identifiers; global_position is the separate cross-stream counter.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-global-position-tracks-all-streams [IN] OBSERVATION

global_position increments monotonically across all streams and equals the total number of events ever appended to the store.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-live-projection-auto-updates [IN] OBSERVATION

LiveProjection automatically applies new events on each append() after the initial catchup() call — no subsequent explicit catchup() required.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-optimistic-concurrency-via-expected-version [IN] OBSERVATION

append(..., expected_version=V) succeeds only if the current stream version equals V; a stale version raises ConcurrencyConflict.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-persistence-uses-jsonl [IN] OBSERVATION

EventStore disk persistence uses JSONL (newline-delimited JSON) format, loaded in full on construction to rebuild the in-memory state.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-projection-catch-up-is-incremental [IN] OBSERVATION

Projection.catch_up() only processes events after its current position, returning the count of newly processed events; a second call with no new events processes zero.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-snapshot-saves-state-and-position [IN] OBSERVATION

Projection.savesnapshot() persists both the projection state and its current position; loadsnapshot() restores both so that catch_up() resumes from the snapshot point.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-snapshots-are-in-memory-dicts [IN] OBSERVATION

Event sourcing snapshots on store.snapshots are plain Python dicts monkey-patched onto the store instance with no serialization; getattr(self.store, '_snapshots', {}) defensively handles their absence

Source: entries/2026/05/29/topic-snapshot-persistence.md

es-snapshots-incomplete-for-production-use [IN] OBSERVATION

Event sourcing snapshots are triply limited: stored as ephemeral in-memory dicts (lost on restart), excluding subscriber callbacks (requiring re-registration), and ignored by reconstruct_state (which always replays from the beginning).

es-temporal-query-via-up-to [IN] OBSERVATION

reconstructstate(handlers, events, upto=N) replays exactly the first N events to rebuild past state at that point in time.

Source: entries/2026/05/29/event-sourcing-store-testeventstore.md

es-verify-is-script-not-pytest [IN] OBSERVATION

test_verify.py runs as a standalone Python script with inline assertions and a final print statement, not through a test framework like pytest

Source: entries/2026/05/29/event-sourcing-store-test_verify.md

event-graph-acyclicity-by-construction [IN] OBSERVATION

reaches assumes the event graph is a DAG with no cycle detection; this is safe because Node methods only ever link to previously-created events via parent and _cause.

Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

event-graph-correctness-via-dual-invariant [IN] OBSERVATION

The event causality graph maintains correctness through two complementary invariants: acyclicity by construction (preventing infinite traversal in reachability checks) and identity-based comparison (preventing conflation of structurally-equal but distinct events).

event-ids-are-1-based-sequential [IN] OBSERVATION

eventid is a global sequence number assigned as len(self.events) + 1, starting at 1 and incrementing by 1 for each appended event with no gaps within a process lifetime

Source: entries/2026/05/29/event-sourcing-store-event_store.md

event-infrastructure-unreliable-in-content-and-ordering [IN] OBSERVATION

Derived systems depend on event infrastructure that is independently unreliable in both content and ordering: events may be lost, duplicated, or mis-addressed due to unreliable persistence and addressing, AND their ordering is broken at every layer — WAL sequence numbers are vestigial, event IDs conflate two spaces, and CDC consumer positions are volatile.

event-sourcing-and-cdc-converge-on-projection-pattern [IN] OBSERVATION

Both Projection.catchup and DerivedSystem.processevent track a position cursor, pull/receive ordered events, and apply type-dispatched handlers — the same structural pattern solving "derived data" from different starting points

Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

event-sourcing-conflates-two-id-spaces [IN] OBSERVATION

Event sourcing read operations conflate two ID spaces: projections assume contiguous stream-scoped IDs for catch-up arithmetic, while state reconstruction filters on global event_id, and sequence numbers survive compaction without renumbering — creating potential mismatches when events are removed or compacted.

event-sourcing-lacks-any-durable-recovery-path [IN] OBSERVATION

Event sourcing has no durable recovery mechanism: appends update in-memory state before persisting to disk with no rollback on failure, snapshots are ephemeral in-memory dicts lost on restart, and state reconstruction must always replay from the beginning with no incremental checkpoint support.

event-sourcing-unreliable-in-persistence-and-addressing [IN] OBSERVATION

Event sourcing is fundamentally unreliable at two independent levels: it has no durable recovery mechanism (in-memory state updated before disk, no crash-consistent append, ephemeral snapshots), and its core addressing abstraction conflates stream-scoped and global ID spaces across read paths, meaning even successfully persisted events may be incorrectly reconstructed.

event-store-append-is-not-crash-consistent [IN] OBSERVATION

Event store appends update in-memory state before persisting to disk with no rollback on failure, batch appends can leave partial writes, and the per-call file open/close pattern adds overhead without providing atomicity guarantees.

event-store-load-ignores-partial-batches [IN] OBSERVATION

EventStore.loadfromfile replays every NDJSON line on startup with no mechanism to detect or discard incomplete batches left by a mid-batch crash, unlike the WAL which uses OPCOMMIT records as transaction boundaries.

Source: entries/2026/05/29/topic-batch-atomicity-gap.md

event-store-memory-ahead-of-disk [IN] OBSERVATION

Event store appends events to the in-memory events list before calling persist_event, with no rollback on write failure — a crash or I/O error leaves the in-memory state ahead of disk with no mechanism to reconcile.

Source: entries/2026/05/29/topic-crash-recovery-invariants.md

event-store-ndjson-could-resync-but-doesnt [IN] OBSERVATION

The event store uses newline-delimited JSON where per-line resync is architecturally possible, but loadfrom_file halts on the first json.JSONDecodeError rather than skipping the bad line

Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

event-store-no-explicit-durability [IN] OBSERVATION

event-sourcing-store/eventstore.py:persist_event relies solely on Python's context manager close() for implicit flush() with no explicit flush() or fsync(), providing no durability guarantee against either process crash or power loss.

Source: entries/2026/05/29/topic-bitcask-durability-tradeoffs.md

event-store-optimistic-concurrency [IN] OBSERVATION

EventStore.append() accepts an expected_version parameter and rejects appends when the stream's current version doesn't match, implementing optimistic concurrency control for event streams

Source: entries/2026/05/29/topic-ddia-ch12-unbundling.md

event-store-optimistic-concurrency-via-expected-version [IN] OBSERVATION

EventStore.append takes an expected_version parameter for optimistic concurrency on stream appends, rejecting writes when the stream has advanced past the caller's version

Source: entries/2026/05/29/topic-ddia-unbundling-concept.md

event-store-persist-no-durability [IN] OBSERVATION

EventStore.persistevent() in event-sourcing-store/event_store.py opens the file in a with block, writes JSON, and relies on implicit close() — no explicit flush() or os.fsync(), making persisted events vulnerable to loss on OS crash

Source: entries/2026/05/29/topic-os-fsync-durability.md

event-store-persist-no-flush-no-fsync [IN] OBSERVATION

EventStore.persistevent opens the NDJSON file in append mode, writes one JSON line, and closes it without calling flush() or os.fsync(); written data may remain in OS buffers or Python buffers at crash time.

Source: entries/2026/05/29/topic-batch-atomicity-gap.md

event-store-persist-pattern [IN] OBSERVATION

EventStore in event-sourcing-store/eventstore.py is the repo's existing model for optional disk persistence: accept persistpath in constructor, call loadfromfile on init if the file exists, call persist_event on every mutation

Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

event-store-persist-separate-open-close [IN] OBSERVATION

EventStore.persistevent performs a separate open()/write()/close() cycle for each individual event, so append_batch with N events results in N independent file operations rather than a single buffered write.

Source: entries/2026/05/29/topic-batch-atomicity-gap.md

event-store-single-threaded-assumption [IN] OBSERVATION

EventStore assumes single-threaded access: eventid = len(self.events) + 1 is a race condition under concurrency, and no locking protects events, streams, or subscriber notification.

Source: entries/2026/05/29/event-sourcing-store-eventstore-appendbatch.md

event-store-single-writer-assumption [IN] OBSERVATION

The event store's expectedversion check-then-act pattern in append and appendbatch has no locking, making the optimistic concurrency guard safe only under a single-writer concurrency model — concurrent writers create a TOCTOU race

Source: entries/2026/05/29/topic-event-sourcing-concurrency-model.md

expiration-cutoff-is-one-sided-despite-symmetric-matching [IN] OBSERVATION

expireevents uses watermark - duration as a one-sided cutoff for buffer cleanup, while contains() is symmetric; this is correct because future events can only arrive at or after the watermark

Source: entries/2026/05/29/topic-sliding-vs-tumbling-windows.md

external-sort-is-transparent [IN] OBSERVATION

Sort with memory_limit produces the same output ordering as an in-memory sort — the external merge-sort is a pure implementation optimization with no semantic difference.

Source: entries/2026/05/29/batch-word-count-test_pipeline.md

fdatasync-never-used [IN] OBSERVATION

No implementation uses os.fdatasync(), missing a safe optimization for append-only WALs where only data (not metadata like mtime) needs to reach disk.

Source: entries/2026/05/29/topic-fsync-flush-distinction.md

fdatasync-not-available-on-darwin [IN] OBSERVATION

Python's os.fdatasync exists only on Linux, not macOS/Darwin; any switch from os.fsync requires a platform fallback via getattr(os, 'fdatasync', os.fsync) to work on the current development environment

Source: entries/2026/05/29/topic-fdatasync-vs-fsync-optimization.md

fdatasync-safe-for-all-append-paths [IN] OBSERVATION

Every file opened in append mode ("ab") in this codebase writes sequentially with monotonically increasing file size, making fdatasync a safe drop-in replacement for fsync on those paths (WAL, Bitcask data files, B-tree WAL)

Source: entries/2026/05/29/topic-fdatasync-vs-fsync-optimization.md

fenced-server-rejects-strictly-lower-tokens [IN] OBSERVATION

FencedResourceServer.write rejects writes where fencing_token < highest (strict less-than, not less-than-or-equal), meaning same-token retries succeed — enabling idempotent writes with the same lock acquisition

Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

fencing-no-exceptions-on-denial [IN] OBSERVATION

All denial conditions in fencing tokens (lock held, stale token, wrong client) return sentinel values (None, False, or error dicts) rather than raising exceptions, modeling distributed system responses

Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

fencing-rejects-stale-writes [IN] OBSERVATION

FencedResourceServer.write() rejects any write with a fencing token strictly less than the highest token previously seen for that resource; equal tokens are accepted

Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

fencing-token-counter-is-global [IN] OBSERVATION

The fencing token counter is global across all locks, not per-lock; tokens issued for different locks are comparable and strictly ordered by acquisition time

Source: entries/2026/05/29/fencing-tokens-testfencingtokens.md

fencing-token-monotonic [IN] OBSERVATION

LockService._counter starts at 1, increments by 1 on every successful acquire(), and is never decremented or reset, guaranteeing strict monotonicity of issued tokens

Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

fencing-token-tracking-is-per-resource [IN] OBSERVATION

Token high-water marks on FencedResourceServer are scoped per resource identifier, so a lower token can succeed on a different resource than the one that saw a higher token

Source: entries/2026/05/29/fencing-tokens-testfencingtokens.md

fencing-tokens-do-not-expire-at-resource-server [IN] OBSERVATION

The FencedResourceServer has no TTL or expiration logic for tokens — once a token number is seen, all lower tokens are permanently rejected for that resource

Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

find-conflicts-detects-concurrency [IN] OBSERVATION

find_conflicts returns True if and only if at least two VersionedValue entries in the input list have concurrent (mutually non-dominating) vector clocks

Source: entries/2026/05/29/vector-clocks-testvectorclock.md

find-leaf-bisect-right-consistency [IN] OBSERVATION

findleaf, search, and insert all use bisectright (not bisectleft) for child-pointer routing, ensuring all three methods agree on which leaf owns a given key; this is consistent with the split strategy where the separator key equals the first key of the right sibling.

Source: entries/2026/05/29/b-tree-storage-engine-btree-findleaf.md

find-leaf-no-type-check [IN] OBSERVATION

findleaf does not verify that a page is an internal node before calling deserializeinternal; a corrupted tree height or misidentified page type produces silently wrong results rather than an error.

Source: entries/2026/05/29/b-tree-storage-engine-btree-findleaf.md

find-leaf-range-scan-only [IN] OBSERVATION

findleaf is called exclusively by rangescan; point lookups use search instead, which reads the leaf inline and returns the value rather than the page number.

Source: entries/2026/05/29/b-tree-storage-engine-btree-findleaf.md

first-committer-wins-on-write-conflict [IN] OBSERVATION

When two concurrent transactions write the same key, the first to call commit() succeeds and the second is aborted automatically

Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

fixture-usage-is-sparse [IN] OBSERVATION

Only 3 of the test*.py files use @pytest.fixture (testcdc, testmapsidejoins, testbitcask); most pytest files still use inline setup identical to their tester counterparts

Source: entries/2026/05/29/topic-tester-vs-pytest-test-duality.md

flush-before-fsync-invariant [IN] OBSERVATION

Every os.fsync() call in the codebase is immediately preceded by a .flush() call, ensuring Python's userspace buffer is drained to the kernel page cache before requesting stable storage persistence.

Source: entries/2026/05/29/topic-fsync-vs-flush-semantics.md

flush-clears-then-appends [IN] OBSERVATION

flush() both clears self.memtable and appends to self._sstables, creating a window where data exists in neither location if a concurrent reader checks between the two mutations

Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

flush-creates-sequential-sstables [IN] OBSERVATION

Each flush call creates an SSTable file with a strictly increasing sequence number (zero-padded filename for lexicographic = numeric sort), preserving newest-last ordering in self.sstables

Source: entries/2026/05/29/log-structured-merge-tree-lsm-_flush.md

flush-index-drains-pending [IN] OBSERVATION

flushindex() is the sole mechanism that materializes async index updates: it iterates self.pending, calls applyindex_op() for each entry, and returns the count of operations applied

Source: entries/2026/05/29/topic-async-index-consistency.md

flush-skips-immutable-memtable-stage [IN] OBSERVATION

flush writes the frozen memtable directly to an SSTable without staging it in immutablememtables, creating a brief window where in-flight keys are invisible to get() despite immutable_memtables being checked in the read path

Source: entries/2026/05/29/log-structured-merge-tree-lsm-_flush.md

force-true-bypasses-sync-mode [IN] OBSERVATION

Passing force=True to dosync() causes flush+fsync regardless of the configured sync mode, enabling group commit at batch boundaries

Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

force-true-used-at-rotation-and-close [IN] OBSERVATION

The only dosync(force=True) call sites (lines 165 and 175) correspond to segment rotation and WAL close, both operations where proceeding without a flush would risk data loss or file corruption

Source: entries/2026/05/29/topic-sync-mode-none-safety.md

format-and-structure-prevent-all-repair [IN] OBSERVATION

The system can neither self-heal during operation nor be evolved to add self-healing capabilities: runtime degradation is permanent (leaked pages accumulate, tree height only grows, no rebalancing occurs, crash recovery has no safe path), AND the rigid binary formats across the entire storage stack prevent adding recovery mechanisms such as resync points, version negotiation, or structural checksums.

format-rigidity-prevents-evolutionary-repair [IN] OBSERVATION

The system cannot evolve its way out of known corruption vulnerabilities: the rigid binary format design across the entire storage stack prevents forward evolution and post-corruption recovery (no block alignment, no version fields, no extensibility), and the system simultaneously lacks defense-in-depth against the corruption these formats cannot recover from (no input validation, no resync capability).

free-list-head-persisted-in-page-zero [IN] OBSERVATION

The free list head pointer is stored as the fifth field of the metadata page (page 0) in the layout [rootpage:4B][height:4B][totalkeys:4B][nextfreepage:4B][freelisthead:4B], making it durable across restarts.

Source: entries/2026/05/29/topic-intrusive-free-list-pattern.md

free-list-is-lifo [IN] OBSERVATION

The page free list is LIFO: freepage pushes to the head and allocatepage pops from the head, so the most recently freed page is reused first.

Source: entries/2026/05/29/b-tree-storage-engine-btree-free_page.md

free-list-is-lifo-stack [IN] OBSERVATION

The B-tree's free page list is a LIFO stack: freepage pushes onto the head and allocatepage pops from the head, so pages are reused in reverse order of when they were freed.

Source: entries/2026/05/29/topic-free-list-fragmentation.md

free-list-layout-coupled [IN] OBSERVATION

allocatepage and freepage share an implicit contract on free-list node layout (3-byte zeroed header + 4-byte big-endian next-pointer); neither validates the other's output, so a format change in one silently breaks the other

Source: entries/2026/05/28/b-tree-storage-engine-btree-allocate_page.md

free-list-lifo-decorrelates-layout [IN] OBSERVATION

LIFO free-list reuse means delete-then-reinsert cycles progressively decorrelate logical key order from physical page order, defeating sequential I/O readahead for range scans.

Source: entries/2026/05/29/topic-free-list-fragmentation.md

free-page-only-called-from-delete [IN] OBSERVATION

PageManager.freepage is only invoked by BTree.delete when removing an emptied leaf node; no other code path frees pages.

Source: entries/2026/05/29/b-tree-storage-engine-btree-free_page.md

free-page-overwrites-content [IN] OBSERVATION

PageManager.free_page() overwrites the freed page's content with a zero header and a free-list pointer, destroying the original leaf/internal node data so freed pages cannot be accidentally read as valid tree nodes.

Source: entries/2026/05/29/topic-free-list-reuse-under-churn.md

free-page-stores-only-seven-bytes [IN] OBSERVATION

Each free-list page occupies a full page (default 4096 bytes) but stores only 7 bytes of useful data (3-byte zero header + 4-byte next pointer); the remaining bytes are zero padding from write_page.

Source: entries/2026/05/29/topic-free-list-fragmentation.md

fresh-sstables-always-start-at-level-zero [IN] OBSERVATION

SSTableWriter.finish() hardcodes level=0 (sstable.py:104), matching LevelDB's invariant that flushed memtables always produce L0 files

Source: entries/2026/05/29/topic-leveldb-version-set.md

fsync-policy-inconsistent-across-critical-paths [IN] OBSERVATION

The codebase has a systematic fsync policy inconsistency between components: WAL critical operations (checkpoints, batch commits, segment rotations) always force-fsync regardless of configured mode, but B-tree metadata mutations (which are equally critical for crash recovery) skip fsync entirely despite paying double fsync for user data pages.

fsync-used-only-for-appends [IN] OBSERVATION

os.fsync appears in the WAL append path and Bitcask record writes but never in SSTable creation, compaction output, or hint file writes — durability is applied to the append hot path only

Source: entries/2026/05/29/topic-write-to-temp-then-rename-pattern.md

full-data-lifecycle-unsafe-from-write-through-read [IN] OBSERVATION

The complete data lifecycle is unsafe from storage maintenance through data retrieval: compaction is the highest-risk operation that can permanently lose or resurrect deleted data, and the read path from SSTable through CDC to derived systems is unreliable at every stage, meaning data is at risk whether it is being reorganized for efficiency or being served to consumers.

full-stack-restart-fragility [IN] OBSERVATION

Both the physical storage layer and the logical transaction layer are independently fragile under restart, creating a full-stack restart hazard: the B-tree's durability model protects user data pages but not structural metadata (height, sibling chain, free list), while the transaction system's isolation model depends on monotonic counters and abort-as-status-change semantics that have no persistence backing.

gc-no-active-keeps-one [IN] OBSERVATION

When no transactions are active, garbage_collect() retains at most one version per key — the latest committed non-deleted version — and drops everything else including fully-deleted keys.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-garbagecollect.md

gc-not-thread-safe [IN] OBSERVATION

garbagecollect() mutates versions (replacing lists and deleting keys) without synchronization, assuming single-threaded execution — concurrent reads or writes during GC would race on the dict and its lists.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-garbagecollect.md

gc-preserves-active-snapshots [IN] OBSERVATION

Garbage collection removes only versions unreachable by any active transaction, ensuring long-running read-only transactions see consistent data

Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

gc-unconditionally-purges-aborted [IN] OBSERVATION

garbagecollect() removes all versions with createdby in _aborted as its first step, regardless of active transaction state — aborted versions are invisible to everyone and always safe to drop.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-garbagecollect.md

gc-uses-min-txid-horizon [IN] OBSERVATION

With active transactions, garbagecollect() uses min(active txids) as the GC horizon; only versions superseded by a committed version with txid < mintx_id are eligible for removal.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-garbagecollect.md

get-pending-drains-atomically [IN] OBSERVATION

getpendingchanges returns the current _pending list and replaces it with an empty list, ensuring each change is delivered exactly once per replication cycle

Source: entries/2026/05/29/topic-ring-topology-propagation.md

gossip-bandwidth-scales-linearly-per-message [IN] OBSERVATION

Each gossip round transmits the entire membership list per exchange despite a fanout of only one peer per node, making per-message payload O(N) in cluster size and total per-round bandwidth O(N squared) across the cluster, even though only N point-to-point messages are sent.

gossip-carries-membership-not-data [IN] OBSERVATION

GossipNode.receive_gossip merges heartbeat counters and node status (alive/suspected/dead) only; it never exchanges application key-value data, making it a SWIM-style failure detector rather than a data replication protocol

Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

gossip-cleanup-bounds-membership-growth [IN] OBSERVATION

Dead nodes are removed from the membership list after t_cleanup elapsed time (default 20), preventing unbounded growth of the membership table from accumulated failure records

Source: entries/2026/05/29/topic-swim-protocol.md

gossip-cleanup-removes-entry-entirely [IN] OBSERVATION

After t_cleanup elapses, a dead node's record is fully deleted from membership (not just flagged), which enables clean rejoin with a fresh identity

Source: entries/2026/05/29/gossip-protocol-testgossipprotocol.md

gossip-cluster-uses-logical-time [IN] OBSERVATION

The gossip simulation uses explicit logical timestamps passed as parameters (not wall-clock time), with a fixed RNG seed for deterministic gossip partner selection, following the same simulated-time pattern as hinted-handoff

Source: entries/2026/05/29/gossip-protocol-testgossipprotocol.md

gossip-convergence-is-olog-n [IN] OBSERVATION

Full membership convergence occurs within O(log N) gossip rounds, empirically bounded by 5 * log₂(N) + 5 rounds in the test suite

Source: entries/2026/05/29/gossip-protocol-testgossipprotocol.md

gossip-dead-node-lifecycle-is-comprehensive [IN] OBSERVATION

Dead nodes in the gossip protocol follow a comprehensive, irreversible lifecycle with three independent safeguards: death status cannot be reversed by incoming gossip messages, dead node records are fully removed from the membership list after the cleanup interval (not merely flagged), and incoming gossip about already-dead nodes from other peers is silently filtered to prevent zombie reintroduction through stale state.

gossip-dead-nodes-filtered-on-receive [IN] OBSERVATION

When receiving gossip, unknown nodes that arrive with dead status are silently dropped to prevent zombie membership entries from propagating through the cluster

Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

gossip-death-is-irreversible-via-gossip [IN] OBSERVATION

Once a node's status is dead, receiving a gossip message with status: alive for that node will never revert it — the merge logic at line 67 checks local["status"] != "dead" before allowing exoneration

Source: entries/2026/05/29/topic-swim-protocol.md

gossip-deep-copy-isolation [IN] OBSERVATION

All inter-node data transfer (sendgossip, join, getmembership_list) uses copy.deepcopy to prevent shared mutable state between simulated nodes

Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

gossip-detect-failures-is-stateless [IN] OBSERVATION

detectfailures makes decisions using only currenttime - timestamplastupdated; it consults no historical distribution or sliding window, making it a pure point-in-time comparison rather than a statistical inference.

Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

gossip-failure-detection-gates-replication [IN] OBSERVATION

GossipNode.getalivemembers returns the liveness set that anti-entropy, read-repair, and hinted-handoff all depend on to decide which nodes to contact — gossip provides the membership layer that every data-exchange layer needs

Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

gossip-failure-detection-governs-cluster-correctness [IN] OBSERVATION

Gossip-based failure detection is the single correctness bottleneck for the distributed cluster: replication, read repair, and hinted handoff all depend on its timeout-driven liveness set, which is bounded by cleanup to prevent unbounded membership growth.

gossip-failure-detection-lacks-adaptive-accuracy [IN] OBSERVATION

Gossip failure detection is both the single correctness bottleneck for the cluster AND permanently miscalibrated: it governs all replication, read repair, and hinted handoff decisions, yet stores only the most recent timestamp per peer with no arrival history, making it unable to adapt detection thresholds to actual network conditions as phi-accrual detection would require.

gossip-fanout-is-one [IN] OBSERVATION

Each node selects exactly one random peer per gossip round, with bidirectional exchange producing at most N pairwise syncs per round.

Source: entries/2026/05/29/topic-topology-propagation-bounds.md

gossip-fixed-three-threshold-fsm [IN] OBSERVATION

Failure detection uses exactly three fixed time thresholds (tsuspect=5, tdead=10, t_cleanup=20) producing a deterministic state machine with no runtime adaptation to network conditions.

Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

gossip-merge-uses-heartbeat-monotonicity [IN] OBSERVATION

receive_gossip only accepts remote state when the remote heartbeat counter strictly exceeds the local counter, except for death notifications at equal counters which are also accepted

Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

gossip-no-arrival-history [IN] OBSERVATION

Each node stores only the most recent timestamplastupdated per peer, discarding all inter-arrival time history that would be needed for probabilistic failure detection (e.g., phi accrual).

Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

gossip-node-status-lifecycle [IN] OBSERVATION

Node status follows alive → suspected → dead → removed with configurable timeouts tsuspect, tdead, t_cleanup governing transitions; a suspected node can return to alive if a higher heartbeat counter arrives

Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

gossip-suspicion-is-timeout-based [IN] OBSERVATION

Suspicion in the gossip protocol is triggered by elapsed time exceeding t_suspect since the last heartbeat update, not by failed probe responses as in the full SWIM paper's ping/indirect-ping protocol

Source: entries/2026/05/29/topic-swim-protocol.md

gossip-topology-is-fully-connected [IN] OBSERVATION

GossipCluster uses unrestricted random peer selection (any node can gossip with any other), making the effective topology a fully-connected graph.

Source: entries/2026/05/29/topic-topology-propagation-bounds.md

gossip-uniform-thresholds [IN] OBSERVATION

All nodes in a GossipCluster receive identical threshold values from the cluster constructor at _init_ (line 117); individual nodes cannot calibrate sensitivity to their specific network path.

Source: entries/2026/05/29/topic-phi-accrual-failure-detector.md

gossip-uses-full-membership-exchange [IN] OBSERVATION

Each gossip round transmits the entire membership list via deepcopy rather than deltas or piggybacked updates, making per-exchange bandwidth cost O(N) in cluster size instead of O(1) amortized

Source: entries/2026/05/29/topic-swim-protocol.md

gossip-voluntary-leave-broadcasts-to-all [IN] OBSERVATION

A leaving node sends its death status to every active peer (not just a random one) before deactivating, unlike normal gossip which uses random pairwise exchange

Source: entries/2026/05/29/gossip-protocol-gossip_protocol.md

handlers-must-mutate-state-in-place [IN] OBSERVATION

Both reconstruct_state and Projection expect handler functions to mutate the state dict in place rather than returning a new value; a handler that returns without mutating silently loses its changes

Source: entries/2026/05/29/event-sourcing-store-eventstore-reconstructstate.md

happens-before-requires-two-reaches-calls [IN] OBSERVATION

happensbefore calls reaches in both directions to distinguish "a before b", "b before a", and "concurrent" — a single call can only confirm or deny one direction.

Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

hash-index-all-keys-in-memory [IN] OBSERVATION

Both Bitcask implementations keep every live key in an in-memory dict (keydir / _index), meaning the key set must fit in RAM; there is no disk-based fallback for partial index spill.

Source: entries/2026/05/29/topic-ddia-chapter3-hash-indexes.md

hash-index-bitcask-no-checksum [IN] OBSERVATION

The hash-index-storage Bitcask records contain no CRC or checksum field, unlike the log-structured-hash-table implementation; on-disk corruption is undetectable during reads or compaction.

Source: entries/2026/05/29/hash-index-storage-bitcask-writerecord.md

hash-index-bitcask-shared-read-handles [IN] OBSERVATION

hash-index-storage/bitcask.py uses a single cached file handle per segment for all reads via getreader(), making concurrent reads to the same segment unsafe due to shared seek position.

Source: entries/2026/05/29/topic-reference-counted-file-handles.md

hash-index-compaction-manual-only [IN] OBSERVATION

The hash-index-storage/bitcask.py compact() must be called explicitly by the caller; there is no auto-compact trigger, threshold tracking, or background compaction unlike the log-structured-hash-table variant

Source: entries/2026/05/29/topic-bitcask-paper-comparison.md

hash-index-get-tombstone-guard [IN] OBSERVATION

hash-index-storage/bitcask.py's get() checks if value == "": return None, providing a defense-in-depth layer against tombstone leaks that log-structured-hash-table's get() lacks

Source: entries/2026/05/29/topic-tombstone-handling-in-hint-files.md

hash-index-hint-self-sufficient [IN] OBSERVATION

The hash-index-storage hint file contains all four keydir fields (file_id, offset, size, timestamp), making it sufficient to fully reconstruct the in-memory index without reading data files

Source: entries/2026/05/29/topic-bitcask-hint-file-format.md

hash-index-is-memory-bound-by-design [IN] OBSERVATION

Hash index storage is fundamentally memory-bound: every key must reside in RAM for the single-seek O(1) read path, making dataset size directly constrained by available memory with no spill-to-disk fallback.

hash-index-keys-must-fit-in-ram [IN] OBSERVATION

Both Bitcask implementations require every live key to be held in a Python dict in memory; the dataset's key space is bounded by available RAM

Source: entries/2026/05/29/topic-bitcask-vs-lsm-tree.md

hash-index-no-crc-by-design [IN] OBSERVATION

hash-index-storage/bitcask.py intentionally omits CRC to focus on keydir/append-log/compaction; log-structured-hash-table/bitcask.py provides the integrity-checking variant with zlib.crc32 and CorruptionError

Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

hash-index-no-range-queries [IN] OBSERVATION

Neither hash index implementation (hash-index-storage/bitcask.py or log-structured-hash-table/bitcask.py) provides a scan, range, or ordered-iteration method; only exact-key point lookups are supported.

Source: entries/2026/05/29/topic-ddia-chapter3-hash-indexes.md

hash-index-read-is-single-seek [IN] OBSERVATION

A Bitcask get() does one dict lookup plus one positioned disk read (O(1)), while an LSM-tree get() may search the memtable then multiple SSTables from newest to oldest (O(log N) per level)

Source: entries/2026/05/29/topic-bitcask-vs-lsm-tree.md

hash-index-sync-writes-controls-fsync [IN] OBSERVATION

In hash-index-storage Bitcask, the syncwrites flag controls whether each write_record call fsyncs to disk; when False, recently written records may be lost on crash.

Source: entries/2026/05/29/hash-index-storage-bitcask-writerecord.md

hash-index-wall-clock-not-monotonic [IN] OBSERVATION

writerecord timestamps use time.time(), which is not monotonic — NTP adjustments or manual clock changes can cause a newer write to carry an older timestamp, confusing compaction's latest-wins conflict resolution.

Source: entries/2026/05/29/hash-index-storage-bitcask-writerecord.md

hash-index-write-no-index-mutation [IN] OBSERVATION

writerecord in hash-index-storage/bitcask.py does not modify keydir; the caller (put or delete) is responsible for updating the in-memory index after the write returns.

Source: entries/2026/05/29/hash-index-storage-bitcask-writerecord.md

hash-mod-destroys-key-order [IN] OBSERVATION

Hash-mod partitioning (hash(k) % num_reducers) scatters lexicographically adjacent keys across different partitions, making range queries require a full scatter-gather across all reducers; MapReduce run() re-sorts the final results to compensate

Source: entries/2026/05/29/topic-hash-partitioning-skew.md

hash-mod-partition-count-is-static [IN] OBSERVATION

In mapreduce.py, numreducers is fixed at job creation (numreducers: int = 2) and never changes during execution; all partition assignments are determined by hash(k) % num_reducers with no dynamic rebalancing

Source: entries/2026/05/29/topic-hash-partitioning-skew.md

hash-mod-skew-shared-across-modules [IN] OBSERVATION

Three modules use the same hash(k) % numpartitions pattern with fixed partition counts and no hot-key mitigation: mapreduce.py:109, secondaryindexpartitioning.py:56, and partitionedlog.py:149 — all share the straggler vulnerability where a single hot key overwhelms one partition

Source: entries/2026/05/29/topic-hash-partitioning-skew.md

hashlib-reserved-for-crypto-distribution [IN] OBSERVATION

The codebase consistently uses zlib.crc32 for corruption detection in storage engines and hashlib (SHA-256) for content addressing (Merkle trees), uniform distribution (consistent hashing, bloom filters), and cryptographic integrity (BFT) — the two hash families never cross purposes

Source: entries/2026/05/29/topic-crc32-vs-xxhash-choice.md

header-corruption-derails-sequential-scan [IN] OBSERVATION

A corrupted keysize or valsize in hash-index-storage/bitcask.py causes scandata_file to read wrong byte boundaries, potentially misinterpreting all subsequent records in that file

Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

heap-tiebreak-uses-source-index [IN] OBSERVATION

The merge heap tuple includes a source_index field solely to prevent Python from comparing SSTableEntry objects when (key, -timestamp) ties, which would raise TypeError.

Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

hint-corrupt-worse-than-missing [IN] OBSERVATION

A corrupted hint file silently loads bad offsets into the index with no validation; a missing hint file triggers a CRC-checked full scan via scansegment, making hint corruption strictly worse than hint absence.

Source: entries/2026/05/29/log-structured-hash-table-bitcask-loadhint_file.md

hint-expiry-bounds-durability [IN] OBSERVATION

Hints in HintedHandoffStore have a TTL; if triggerhandoff doesn't run before createdat + hintttl, expirehints silently drops the hint, permanently losing that replica's copy of the data

Source: entries/2026/05/29/topic-sloppy-quorum-tradeoffs.md

hint-file-no-fsync [IN] OBSERVATION

Hint files are written without fsync; they are a best-effort startup optimization that can be lost on crash without data loss, since the data file remains authoritative.

Source: entries/2026/05/28/hash-index-storage-bitcask-writehint_file.md

hint-file-no-integrity-check [IN] OBSERVATION

Neither the hint file writer nor reader performs checksum validation; a truncated or corrupted hint file causes a struct.error or IndexError on startup rather than a graceful fallback to data file scanning.

Source: entries/2026/05/28/hash-index-storage-bitcask-writehint_file.md

hint-file-orphan-crash-risk [IN] OBSERVATION

In hash-index-storage/bitcask.py, a crash between deleting a data file and its hint file leaves an orphaned hint that directs reads to a nonexistent data file, causing hard failures on get().

Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

hint-file-write-not-atomic [IN] OBSERVATION

Hint files are written directly to the final path (not via temp-file-and-rename), so a crash mid-generation leaves a partial hint file; loadhint_file handles this by stopping at the first short read

Source: entries/2026/05/29/log-structured-hash-table-bitcask-createhintfiles.md

hint-files-are-optimization-only [IN] OBSERVATION

Hint files accelerate index rebuild during recovery but carry no crash-consistency guarantees; they are not consulted to determine which data files constitute the valid current state

Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

hint-files-only-canonical-entries [IN] OBSERVATION

A hint entry is written only when self._index[key] confirms the key's canonical location is the current segment and offset, filtering out entries superseded by later writes or compaction

Source: entries/2026/05/29/log-structured-hash-table-bitcask-createhintfiles.md

hint-format-diverges-between-implementations [IN] OBSERVATION

Hash-index hint uses a 24-byte fixed header per entry (fileid:u32, offset:u64, size:u32, timestamp:f64) while log-structured hint uses only an 8-byte header (keysize:u32, offset:u32), omitting file_id, timestamp, and record size

Source: entries/2026/05/29/topic-bitcask-startup-cost.md

hint-format-symmetry [IN] OBSERVATION

writehintfile and loadhintfile must use identical binary layouts: [HINTFORMAT header (24 bytes)][keylen uint32 (4 bytes)][key_bytes (variable)] per entry; any change to one without the other breaks backward compatibility.

Source: entries/2026/05/28/hash-index-storage-bitcask-writehint_file.md

hint-format-variable-length [IN] OBSERVATION

Each hint record is HINTHEADERSIZE (24 bytes) + 4 (keysize uint32) + keylength bytes; the HINT_FORMAT constant covers only the fixed 24-byte portion, not the full record.

Source: entries/2026/05/29/hash-index-storage-bitcask-loadhint_file.md

hint-generation-no-crc-validation [IN] OBSERVATION

createhintfiles does not verify CRC checksums on the segment records it reads (unlike scansegment), so corrupted data can be silently indexed via hint files

Source: entries/2026/05/29/log-structured-hash-table-bitcask-createhintfiles.md

hint-generation-skips-active-segment [IN] OBSERVATION

createhintfiles never generates a hint file for the active segment, only for frozen (immutable) segments, since a hint for the active segment would be immediately stale

Source: entries/2026/05/29/log-structured-hash-table-bitcask-createhintfiles.md

hint-offset-is-segment-offset [IN] OBSERVATION

The offset value stored in a hint entry is a byte position in the corresponding .dat segment file, not a position within the hint file itself; get() uses this offset to seek directly into the segment.

Source: entries/2026/05/29/log-structured-hash-table-bitcask-loadhint_file.md

hint-written-only-during-compaction [IN] OBSERVATION

writehint_file is called exclusively from compact(), never during normal put/delete operations; hint files only exist for compacted (merged) data files.

Source: entries/2026/05/28/hash-index-storage-bitcask-writehint_file.md

hinted-handoff-hint-cleanup-all-or-nothing [IN] OBSERVATION

removehintsfor() purges all hints for a recovered target node regardless of whether individual hints were successfully delivered or had expired; after trigger_handoff, no hints for that node remain on any other node.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

hinted-handoff-hint-loss-is-silent-data-loss [IN] OBSERVATION

If a hint node crashes and its hints are lost before handoff, the target node never receives the data and no error is raised — requiring anti-entropy repair (e.g., Merkle trees) to detect and fix.

Source: entries/2026/05/29/hinted-handoff-testhintedhandoff.md

hinted-handoff-hint-ttl-exclusive-boundary [IN] OBSERVATION

Hint expiration uses exclusive comparison: a hint created at time t with TTL d survives at t + d - 1 but is expired at t + d.

Source: entries/2026/05/29/hinted-handoff-testhintedhandoff.md

hinted-handoff-no-exceptions [IN] OBSERVATION

Write failures are signaled via success: False in the return dict; the hinted handoff module never raises exceptions in normal operation.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

hinted-handoff-one-hint-per-target-per-write [IN] OBSERVATION

Each unavailable preferred replica gets at most one hint stored on one non-preferred node per write operation, preventing hint explosion.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

hinted-handoff-reads-skip-non-preferred [IN] OBSERVATION

get() only queries preferred replicas, so data existing only as hints on non-preferred nodes is invisible to reads until handoff delivers it.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

hinted-handoff-sloppy-quorum-counts-hints [IN] OBSERVATION

When sloppyquorum=True (the default), stored hints count toward writequorum the same as direct replica writes, trading consistency for write availability.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

hinted-handoff-sloppy-quorum-writes-to-non-preferred [IN] OBSERVATION

When sloppy quorum is enabled and preferred replicas are down, writes succeed by routing to non-preferred substitute nodes that store hints for later delivery.

Source: entries/2026/05/29/hinted-handoff-testhintedhandoff.md

hinted-handoff-strict-quorum-rejects-insufficient-preferred [IN] OBSERVATION

With sloppyquorum=False, a write fails if fewer than writequorum preferred nodes are available, even if non-preferred nodes could serve.

Source: entries/2026/05/29/hinted-handoff-testhintedhandoff.md

hinted-handoff-time-is-parameter [IN] OBSERVATION

All time-dependent operations (put, triggerhandoff, expireallhints) take currenttime as an explicit argument rather than calling time.time(), making the implementation fully deterministic and testable.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

hinted-handoff-ttl-bounds-hint-lifetime [IN] OBSERVATION

Hints expire after createdat + ttl (checked via Hint.isexpired); a node that stays down longer than the TTL window will not receive those hinted writes and must rely on anti-entropy for convergence.

Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

hinted-handoff-version-monotonic-per-key [IN] OBSERVATION

The coordinator assigns strictly increasing versions per key via a centralized counter, so stale hint replays are absorbed harmlessly by Node.put()'s version >= existing_version check.

Source: entries/2026/05/29/hinted-handoff-hinted_handoff.md

index-is-volatile [IN] OBSERVATION

All three storage engines rebuild their in-memory index (keydir, _index) entirely from on-disk files at startup; inconsistent file state after a crash produces a silently incomplete index with no error

Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

init-meta-unprotected-by-wal [IN] OBSERVATION

During initial file creation, PageManager._init writes metadata and the root leaf page directly via write_meta without WAL protection; acceptable because no user data exists yet to lose.

Source: entries/2026/05/29/b-tree-storage-engine-btree-writemeta.md

input-validation-systematically-absent [IN] OBSERVATION

Input validation is systematically absent across configuration and API boundaries: the WAL accepts arbitrary strings as sync mode (silently disabling all durability guarantees), truncate accepts out-of-range sequence numbers (silently deleting all data), compaction strategy selection silently falls through to leveled on any unrecognized string, and Merkle proof direction is never validated — misconfiguration produces incorrect behavior rather than failing fast.

integrity-checking-is-both-incomplete-and-unrecoverable [IN] OBSERVATION

Data integrity checking fails in two complementary ways: CRC checksums exclude routing metadata (sequence numbers, page numbers, headers), so corruption in those fields goes undetected; and every reader halts at the first detected CRC failure with no resync capability — meaning both detectable and undetectable corruption are terminal.

integrity-degrades-along-storage-pipeline [IN] OBSERVATION

Data integrity verification degrades monotonically along the storage pipeline: WAL records have CRC checksums that exclude routing metadata (partial coverage), SSTables have no checksums at all (zero coverage), and neither layer can recover from detected corruption — data transitions from partially verified to completely unverified as it moves through compaction from WAL to SSTable.

internal-node-children-invariant [IN] OBSERVATION

serializeinternal assumes len(children) == len(keys) + 1 but does not assert it; too few children raises IndexError, while extra children beyond keys+1 are silently dropped — a potential data-loss path if the invariant is violated.

Source: entries/2026/05/29/b-tree-storage-engine-btree-serializeinternal.md

internal-nodes-no-sibling-pointer [IN] OBSERVATION

Internal nodes have no sibling pointer field; their binary layout is `type:1B][num_keys:2B][child₀:4B... with no right-link, and deserializeinternal returns a 2-tuple (keys, children) vs. the leaf's 3-tuple (keys, values, next_sibling)`.

Source: entries/2026/05/29/topic-internal-node-format.md

internal-page-binary-layout [IN] OBSERVATION

Internal page wire format is [type:1B][num_keys:2B][child0:4B] followed by repeating [keylen:2B][key:varB][child:4B], all big-endian with no alignment padding; this interleaved layout mirrors the logical separator-pointer structure.

Source: entries/2026/05/29/b-tree-storage-engine-btree-serializeinternal.md

iter-descends-left-spine [IN] OBSERVATION

_iter reaches the leftmost leaf by following children[0] at every internal level, then walks the sibling chain to NOSIBLING to enumerate all keys in sorted order

Source: entries/2026/05/29/topic-split-during-insert.md

iter-no-counter-reset [IN] OBSERVATION

Unlike get, put, delete, and rangescan, BTree.iter does not call resetcounters() before running, so I/O page-read stats accumulate on top of any prior count — a gotcha when benchmarking iteration.

Source: entries/2026/05/29/b-tree-storage-engine-btree-_iter_.md

iterate-flushes-without-fsync [IN] OBSERVATION

Before reading, iterate() calls fd.flush() but not os.fsync(), so buffered writes reach the OS page cache but are not guaranteed durable on stable storage.

Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

iterate-not-snapshot-isolated [IN] OBSERVATION

iterate() releases the WAL lock after flushing but before reading begins, so concurrent appends may or may not be visible to the iterator depending on timing.

Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

iterate-stops-silently-at-corruption [IN] OBSERVATION

When iterate() encounters a CRC mismatch, iteration stops with no exception or sentinel value — the caller receives all valid records before the corruption with no indication that more data existed.

Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

iterate-yields-all-op-types [IN] OBSERVATION

iterate() yields records of every op type (PUT, DELETE, COMMIT, CHECKPOINT) while replay() filters to only PUT and DELETE records.

Source: entries/2026/05/28/write-ahead-log-wal-iterate.md

join-determinism-from-event-time [IN] OBSERVATION

The stream join processor's output is fully determined by the sequence of (streamname, key, value, timestamp) inputs and advancetime calls, with no dependency on wall-clock time — making it deterministically testable without clock mocking

Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

join-uses-symmetric-interval-not-tumbling [IN] OBSERVATION

TimeWindow.contains() uses abs(t1 - t2) <= duration, making join matching a symmetric interval check centered on each event, not aligned to any fixed time grid

Source: entries/2026/05/29/topic-sliding-vs-tumbling-windows.md

join-window-and-aggregation-window-are-independent [IN] OBSERVATION

The join TimeWindow duration controls which event pairs can match, while the TumblingWindowAggregator window size controls result grouping; these are independent parameters that can differ without violating any invariant

Source: entries/2026/05/29/topic-sliding-vs-tumbling-windows.md

key-range-merkle-diff-returns-key-names [IN] OBSERVATION

KeyRangeMerkleTree.diff_keys() returns the string key names of divergent entries, translating numeric leaf indices back to keys

Source: entries/2026/05/29/merkle-tree-test_merkle.md

key-range-merkle-sorts-by-key [IN] OBSERVATION

KeyRangeMerkleTree sorts input pairs by key before building, so two trees with identical key-value content always produce the same root hash regardless of insertion order

Source: entries/2026/05/29/merkle-tree-merkle_tree.md

lamport-happens-before-returns-tristate [IN] OBSERVATION

happens_before(a, b, events) returns True (a causes b), False (b causes a), or None (concurrent) — modeling a strict partial order with explicit concurrency detection.

Source: entries/2026/05/29/lamport-clocks-test_lamport.md

lamport-happens-before-uses-graph-not-timestamps [IN] OBSERVATION

happensbefore() determines causality by BFS over parent/cause edges in the event DAG, not by comparing timestamps; the allevents parameter is accepted but never used.

Source: entries/2026/05/29/lamport-clocks-lamport.md

lamport-mutex-lowest-timestamp-priority [IN] OBSERVATION

LamportMutex grants critical section access to the requesting node with the lowest timestamp, with other requesters waiting until the holder releases.

Source: entries/2026/05/29/lamport-clocks-test_lamport.md

lamport-mutex-requires-all-acks [IN] OBSERVATION

canenter() returns True only when the node's request is at the queue head (lowest (timestamp, nodeid)) AND acknowledgments have been received from every other node in the system.

Source: entries/2026/05/29/lamport-clocks-lamport.md

lamport-receive-tick-guarantees-causal-order [IN] OBSERVATION

receive_tick computes max(local, received) + 1, ensuring every receive event has a timestamp strictly greater than the corresponding send event, which is the core Lamport clock invariant.

Source: entries/2026/05/29/lamport-clocks-lamport.md

lamport-receive-tick-max-merge [IN] OBSERVATION

LamportClock.receivetick(remotets) computes max(local, remote_ts) + 1, ensuring the clock advances past both the local and remote state on every receive.

Source: entries/2026/05/29/lamport-clocks-test_lamport.md

lamport-send-creates-receive [IN] OBSERVATION

Node.send_message(target, payload) has a side effect on the target node: it appends a RECEIVE event to the target's event log and advances the target's clock.

Source: entries/2026/05/29/lamport-clocks-test_lamport.md

lamport-send-delivers-synchronously [IN] OBSERVATION

sendmessage calls tonode.receive_message() directly in the same call stack, making message delivery instantaneous and deterministic with no async queue or network simulation.

Source: entries/2026/05/29/lamport-clocks-lamport.md

lamport-timestamp-order-not-implies-causality [IN] OBSERVATION

a.timestamp < b.timestamp does NOT imply a→b; two concurrent events can have ordered timestamps, which is a fundamental limitation of Lamport clocks that vector clocks address.

Source: entries/2026/05/29/lamport-clocks-lamport.md

lamport-total-order-breaks-ties-by-node-id [IN] OBSERVATION

totalorder() sorts by (timestamp, nodeid), using lexicographic node ID comparison as a deterministic tiebreaker when timestamps are equal.

Source: entries/2026/05/29/lamport-clocks-lamport.md

lamport-total-order-tiebreak-by-node-id [IN] OBSERVATION

totalorder() sorts events by (timestamp, nodeid), using the node identifier as a deterministic tiebreaker for concurrent events with equal timestamps.

Source: entries/2026/05/29/lamport-clocks-test_lamport.md

late-event-drop-is-hard-cutoff [IN] OBSERVATION

Events with timestamp < watermark - allowedlateness are unconditionally dropped and counted in stats.lateevents_dropped; there is no secondary path to recover or re-buffer them

Source: entries/2026/05/29/topic-watermark-vs-processing-time.md

lcs-compact-one-per-call [IN] OBSERVATION

lcscompact performs at most one merge operation per invocation and returns immediately; cascading compactions across levels require the caller to loop on run_compaction().

Source: entries/2026/05/29/sstable-and-compaction-sstable-lcscompact.md

lcs-l0-compacts-all [IN] OBSERVATION

L0 compaction always includes every L0 SSTable in the merge set (not just a subset), which can cause large write spikes when many L0 SSTables accumulate before the trigger fires.

Source: entries/2026/05/29/sstable-and-compaction-sstable-lcscompact.md

lcs-l0-triggers-on-count-not-size [IN] OBSERVATION

Level 0 in leveled compaction triggers based on SSTable count (l0compactiontrigger), not total size, because L0 SSTables can have overlapping key ranges unlike higher levels which maintain non-overlapping invariants

Source: entries/2026/05/29/topic-stcs-vs-lcs-tradeoffs.md

lcs-level-size-exponential [IN] OBSERVATION

Level size budgets grow as base_size × fanout^(level-1), so with defaults of 10MB base and 10× fanout: L1=10MB, L2=100MB, L3=1GB, etc.

Source: entries/2026/05/29/sstable-and-compaction-sstable-lcscompact.md

lcs-levels-grow-exponentially [IN] OBSERVATION

Leveled compaction sizes each level as levelbasesize * fanout^(level-1) with a 10MB default base and 7 max levels, so each level is 10x the previous by default

Source: entries/2026/05/29/topic-stcs-vs-lcs-tradeoffs.md

lcs-overlap-selection-is-quadratic [IN] OBSERVATION

During level N compaction, lcscompact calls _overlapping once per SSTable in the level to find the one with maximum overlap, then once more for the winner — making SSTable selection O(n*m) in level sizes.

Source: entries/2026/05/29/sstable-and-compaction-sstable-_overlapping.md

lcs-picks-max-overlap-sstable [IN] OBSERVATION

For level N→N+1 compaction, the SSTable with the most overlapping neighbors in the next level is chosen — the opposite of LevelDB/RocksDB's least-overlap heuristic, increasing write amplification per compaction.

Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

lcs-produces-single-output-sstable [IN] OBSERVATION

Leveled compaction merges all inputs into one SSTable rather than splitting output by size, which means levels above L0 will contain overlapping key ranges after compaction — violating the real LCS invariant that each level has non-overlapping SSTables.

Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

leaf-chain-correct-single-threaded [IN] OBSERVATION

The next_sibling leaf chain is correctly maintained through splits because the WAL logs the right page before the left page (ensuring the pointer target is durable before anything references it) and no concurrent reader can observe intermediate states

Source: entries/2026/05/29/topic-concurrency-and-latch-coupling.md

leaf-entry-cost-is-variable [IN] OBSERVATION

Each leaf entry costs 4 + len(keybytes) + len(valuebytes) bytes (2B key length + key + 2B value length + value), so maximum keys per page depends on actual data sizes, not just a fixed max_keys count.

Source: entries/2026/05/29/topic-page-overflow-and-split-mechanics.md

leaf-nodes-store-all-values [IN] OBSERVATION

Values are stored exclusively in leaf nodes; internal nodes contain only routing keys and child page pointers, making this a B+-tree not a B-tree

Source: entries/2026/05/29/topic-b-plus-tree-leaf-chains.md

leaf-serialized-overhead-is-7-bytes [IN] OBSERVATION

Every leaf page has a fixed 7-byte overhead (3B header via HEADERFMT = '>BH' + 4B nextsibling pointer) before any key-value entries, setting the usable capacity to page_size - 7 bytes.

Source: entries/2026/05/29/topic-page-overflow-and-split-mechanics.md

leaf-sibling-not-blink [IN] OBSERVATION

Leaf next_sibling pointers serve range scans only, not Lehman & Yao concurrent-split recovery; internal nodes have no right-links, making this a standard B+ tree, not a B-link tree

Source: entries/2026/05/29/topic-lehman-yao-1981.md

leaf-split-updates-sibling-pointers [IN] OBSERVATION

When a leaf splits, the new page inherits the old page's nextsibling pointer and the old page's nextsibling is updated to point to the new page — two field writes during an operation that already rewrites both pages

Source: entries/2026/05/29/topic-leaf-sibling-chains.md

leaf-wire-format-is-header-sibling-entries [IN] OBSERVATION

Leaf pages are laid out as [type:1B][numkeys:2B][nextsibling:4B][entries...] where each entry is [keylen:2B][key][vallen:2B][val], all big-endian

Source: entries/2026/05/28/b-tree-storage-engine-btree-serializeleaf.md

lehman-yao-needs-high-keys-and-internal-links [IN] OBSERVATION

The existing leaf next_sibling pointer is necessary but insufficient for Lehman-Yao concurrent access; the algorithm also requires right-links on internal nodes and a high key per node, neither of which this implementation has

Source: entries/2026/05/29/topic-concurrency-and-latch-coupling.md

leveled-compaction-promotes-to-l1 [IN] OBSERVATION

After leveled compaction runs on L0 SSTables, the output readers have level == 1 set by the caller.

Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

leveled-compaction-promotes-to-next-level [IN] OBSERVATION

After leveled compaction, resulting SSTables are assigned level = maxlevel + 1 where maxlevel is the highest level among inputs, establishing the level hierarchy (observable in test assertion at test_sstable.py:119).

Source: entries/2026/05/29/topic-leveled-compaction-write-amplification.md

leveled-triggers-on-l0-count [IN] OBSERVATION

Leveled compaction triggers when the L0 SSTable count meets the l0compactiontrigger parameter

Source: entries/2026/05/29/sstable-and-compaction-testertestsstable.md

linearizable-read-requires-consensus [IN] OBSERVATION

LinearizableRegister.read() broadcasts through the TOB consensus layer rather than reading local state, because a node cannot distinguish "I have the latest value" from "I'm partitioned and stale" without majority confirmation.

Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

live-projection-exactly-once-via-position-guard [IN] OBSERVATION

Live projections achieve exactly-once processing semantics through three reinforcing mechanisms: position-guarded deduplication prevents double-processing during overlapping catch-up and subscription, position advances even for events with no registered handler (preventing future re-evaluation), and the catch-up-then-subscribe pattern eliminates gaps between historical replay and live events.

live-projection-full-replay-on-restart [IN] OBSERVATION

A LiveProjection must replay all events from position 0 on initialization because there is no snapshot restore path wired into its construction

Source: entries/2026/05/29/topic-projection-snapshot-interaction.md

live-projection-idempotent-by-position [IN] OBSERVATION

LiveProjection.onevent guards against double-processing via eventid <= position, making it safe to overlap catch_up() and live subscription without duplicate handler invocations.

Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

live-projection-is-synchronous [IN] OBSERVATION

LiveProjection processes events synchronously inside the append() call path via the _subscribers mechanism, meaning the caller blocks until all live projections have updated

Source: entries/2026/05/29/event-sourcing-store-event_store.md

live-projection-no-error-boundary [IN] OBSERVATION

Handler exceptions in LiveProjection.onevent propagate through EventStore.append() to the caller; the projection's position is not updated, leaving it behind but recoverable via catchup().

Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

live-projection-no-snapshot-interval [IN] OBSERVATION

LiveProjection is never constructed with a snapshotinterval, so the automatic snapshot path in catchup() never triggers for live projections

Source: entries/2026/05/29/topic-projection-snapshot-interaction.md

live-projection-no-unsubscribe [IN] OBSERVATION

Once constructed, a LiveProjection cannot be unsubscribed from its store; the callback reference persists in _subscribers indefinitely with no removal mechanism.

Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

live-projection-position-advances-without-handler [IN] OBSERVATION

Events with no registered handler still advance LiveProjection.position, preventing re-evaluation on subsequent catchup() calls — position tracks last *seen* event, not last *handled* event.

Source: entries/2026/05/29/event-sourcing-store-event_store-LiveProjection.md

live-projection-subscribes-before-catchup [IN] OBSERVATION

LiveProjection._init registers the subscriber before the user calls catchup(), making duplicate processing possible but event loss impossible in the current single-threaded design

Source: entries/2026/05/29/topic-catch-up-subscription-gap.md

live-projection-uses-subscriber-list [IN] OBSERVATION

LiveProjection receives automatic updates by registering a callback in EventStore.subscribers (a list of Callable[[Event], None]]), which is invoked synchronously inside append() and appendbatch()

Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

local-writes-self-immunize [IN] OBSERVATION

Both put and delete call recordseen with the local node's ID, ensuring the originating node drops its own changes when they loop back through the ring

Source: entries/2026/05/29/topic-ring-topology-propagation.md

lock-expiry-enables-reacquisition-without-release [IN] OBSERVATION

LockService.acquire checks isexpired(currenttime) before rejecting a competing acquire, so a crashed or GC-paused client's lock is automatically reclaimable after TTL without requiring explicit release

Source: entries/2026/05/29/topic-ddia-ch8-process-pauses.md

locks-block-future-transactions [IN] OBSERVATION

A participant in the "prepared" state holds key-level locks (self.locks[key] = tx_id) that cause any subsequent transaction touching the same keys to abort with a lock conflict during its own prepare().

Source: entries/2026/05/29/topic-2pc-blocking-problem.md

log-structured-bitcask-no-fsync [IN] OBSERVATION

log-structured-hash-table/bitcask.py uses only flush() without os.fsync() on data writes, unlike hash-index-storage which calls os.fsync per record when sync_writes is enabled

Source: entries/2026/05/29/topic-bitcask-paper-design.md

log-structured-compact-closes-active-file [IN] OBSERVATION

In log-structured-hash-table/bitcask.py, compaction closes activefile during its write phase, making all reads and writes to the active segment impossible until the active file is reopened under a new name at the end of compaction

Source: entries/2026/05/29/topic-concurrent-merge-safety.md

log-structured-hint-creation-decoupled [IN] OBSERVATION

The log-structured Bitcask exposes createhintfiles() as a standalone operation callable independently of compaction, while the hash-index variant only writes hint files inside compact()

Source: entries/2026/05/29/topic-bitcask-startup-cost.md

log-structured-hint-creation-is-manual [IN] OBSERVATION

In log-structured-hash-table/bitcask.py, hint file creation requires an explicit createhintfiles() call rather than being produced automatically as a side-effect of compaction

Source: entries/2026/05/29/topic-bitcask-paper-design.md

log-structured-hint-offset-32bit-cap [IN] OBSERVATION

Hint entry format uses !II (two u32 network-order integers), capping segment file offsets at 4 GiB regardless of maxsegmentsize configuration.

Source: entries/2026/05/29/log-structured-hash-table-bitcask-loadhint_file.md

log-structured-index-omits-size-and-timestamp [IN] OBSERVATION

The log-structured-hash-table Bitcask variant stores only (filepath, offset) in its index, omitting record size and timestamp, requiring an extra header parse on every get() to discover record boundaries

Source: entries/2026/05/29/topic-bitcask-paper-design.md

log-structured-sentinel-is-in-band [IN] OBSERVATION

log-structured-hash-table/bitcask.py uses TOMBSTONE = b"_BITCASKTOMBSTONE__" as an in-band sentinel in the value space; storing those exact bytes as a value causes silent data loss on recovery when the record is misinterpreted as a deletion

Source: entries/2026/05/29/topic-tombstone-semantics.md

lookup-cost-is-logarithmic-in-total-vnodes [IN] OBSERVATION

getnode performs a single bisect.bisect over ring_positions, so key lookup is O(log(N×V)) regardless of cluster size.

Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

lsht-get-no-tombstone-guard [IN] OBSERVATION

BitcaskStore.get() in log-structured-hash-table/bitcask.py returns raw payload bytes without checking for the TOMBSTONE sentinel, so an index entry pointing to a tombstone record surfaces b"_BITCASKTOMBSTONE__" as a value

Source: entries/2026/05/29/topic-tombstone-handling-in-hint-files.md

lsht-hint-no-tombstone-flag [IN] OBSERVATION

The log-structured-hash-table hint entry format (!II = keysize + offset) has no field to distinguish live records from tombstones, making loadhintfile structurally unable to replicate scansegment's deletion logic

Source: entries/2026/05/29/topic-tombstone-handling-in-hint-files.md

lsm-and-sstable-have-no-checksums [IN] OBSERVATION

Neither log-structured-merge-tree/lsm.py nor sstable-and-compaction/sstable.py compute or verify any checksums; a single bit-flip in a length-prefix field causes cascading misframing of all subsequent records

Source: entries/2026/05/29/topic-crc-coverage-audit.md

lsm-compact-no-atomic-rename [IN] OBSERVATION

The LSM tree's compact() method does not use os.rename or os.replace; grep for atomic rename operations returns zero matches in the LSM module, meaning the compaction output is written directly to its final path with no atomic swap.

Source: entries/2026/05/29/topic-crash-recovery-testing.md

lsm-compact-removes-tombstones-safely [IN] OBSERVATION

The LSM Tree's compact() method removes tombstones because it performs a full merge of all SSTables, guaranteeing no surviving SSTable can contain a superseded live value

Source: entries/2026/05/28/topic-tombstone-gc-safety.md

lsm-compaction-blocks-callers [IN] OBSERVATION

LSMTree.compact() runs synchronously in the caller's thread, meaning a put() or delete() that triggers the SSTable count threshold will block until the full k-way merge completes.

Source: entries/2026/05/29/topic-bitcask-merge-window-scheduling.md

lsm-compaction-deletes-without-fsync-barrier [IN] OBSERVATION

The LSM tree's compact() calls os.remove(sst.path) without an explicit fsync on the new SSTable or its parent directory beforehand, meaning the new file's data may not be durable when the old file is unlinked

Source: entries/2026/05/28/topic-bitcask-crash-recovery.md

lsm-compaction-duplicates-safe [IN] OBSERVATION

A crash during LSM compaction produces duplicate entries (old and merged SSTables both present) rather than data loss, because newer SSTables take read precedence.

Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

lsm-compaction-is-full-merge [IN] OBSERVATION

compact() merges all SSTables into a single new SSTable (size-tiered, single-level), not incremental or leveled — simpler but with higher space amplification.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-compaction-last-writer-wins [IN] OBSERVATION

Both LSM implementations resolve key conflicts during compaction by keeping only the newest value; older values are unconditionally discarded with no user-defined merge logic

Source: entries/2026/05/29/topic-rocksdb-merge-operator.md

lsm-crash-recovery-impossible-without-manifest [IN] OBSERVATION

LSM crash recovery is structurally impossible: the set of live SSTable files exists only in memory with no persistent manifest, so a crash permanently loses the SSTable list, and the WAL that could help reconstruct state has no integrity checking to validate its own records during replay.

lsm-crash-test-ignores-compaction [IN] OBSERVATION

testcrashrecovery in test_lsm.py only covers WAL replay for unflushed memtable entries; no test exercises crashes during SSTable compaction, leaving the most dangerous data-loss window (mid-compaction file deletion) untested.

Source: entries/2026/05/29/topic-crash-recovery-testing.md

lsm-flat-compaction-always-bottommost [IN] OBSERVATION

LSMTree.compact() in lsm.py merges all SSTables into one without level hierarchy, making every compaction implicitly a bottommost-level operation where tombstone removal is always safe

Source: entries/2026/05/29/topic-rocksdb-bottommost-compaction.md

lsm-flush-double-loss-window [IN] OBSERVATION

The LSM _flush() truncates the WAL after writing an SSTable, but since neither the SSTable write nor the truncation is fsynced, a crash can lose both the WAL source data and the SSTable destination simultaneously — making the data irrecoverable

Source: entries/2026/05/29/topic-bitcask-crash-recovery-guarantees.md

lsm-forward-only-iteration [IN] OBSERVATION

SSTable iterators (scan, scan_all) support only forward iteration; there is no Prev() or reverse scan capability, preventing ORDER BY DESC or backward cursor pagination

Source: entries/2026/05/29/topic-leveldb-merging-iterator.md

lsm-get-probes-all-sstables-on-miss [IN] OBSERVATION

LSMTree.get() iterates through every SSTable in reverse sequence order and returns only after checking all of them when a key is absent; without bloom filters, missing-key lookups are O(N) in SSTable count

Source: entries/2026/05/29/topic-bloom-filters-for-read-optimization.md

lsm-heapq-imported-unused [IN] OBSERVATION

heapq is imported but never referenced in the code; compact() uses list sorting (sort by (key, -seq)) instead of a heap-based k-way merge.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-memtable-swap-is-reference-not-copy [IN] OBSERVATION

flush line 307 frozen = self.memtable captures a Python reference to the SortedDict, not a deep copy, so concurrent mutation of the dict after the swap would be unsafe without synchronization

Source: entries/2026/05/29/topic-concurrent-flush-safety.md

lsm-memtable-threshold-triggers-flush [IN] OBSERVATION

When the number of entries in the active memtable reaches memtable_threshold, the memtable is automatically frozen and flushed to a new SSTable on disk

Source: entries/2026/05/28/log-structured-merge-tree-test_lsm.md

lsm-miss-probes-all-due-to-no-bloom-integration [IN] OBSERVATION

The LSM tree scans every SSTable on negative lookups because the Bloom filter module — which is correctly implemented with textbook-optimal sizing — is never wired into the read path.

lsm-newest-first-read-path [IN] OBSERVATION

get() searches memtable, then immutable memtables, then SSTables in reverse-seq order (newest first), returning the first match; this guarantees newer writes shadow older ones without scanning all levels.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-no-concurrency-control [IN] OBSERVATION

The LSM tree in lsm.py has no locking, reference counting, or Version snapshots; compact() mutates self._sstables in place, so concurrent readers could see inconsistent state or reference deleted SSTables

Source: entries/2026/05/29/topic-mvcc-version-snapshots.md

lsm-no-manifest [IN] OBSERVATION

The LSM tree has no persistent metadata log (MANIFEST); the set of live SSTable files exists only in memory (self._sstables) and is not crash-recoverable independently of directory scanning

Source: entries/2026/05/28/topic-manifest-based-compaction.md

lsm-no-synchronization [IN] OBSERVATION

LSMTree in lsm.py has zero locking, atomic swaps, or synchronization primitives; all shared state (memtable, sstables) is mutated in-place without protection

Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

lsm-no-write-counters [IN] OBSERVATION

The LSM-tree engine has no built-in byte or operation counters for writes; all write instrumentation (WAL appends, SSTable flushes, compaction output) must be added from scratch

Source: entries/2026/05/29/topic-write-amplification-measurement.md

lsm-range-scan-materializes-all [IN] OBSERVATION

lsm.py:range_scan loads all matching entries from every SSTable and memtable into a dict before returning any results, making first-result latency proportional to total result set size

Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

lsm-replay-strips-seq-nums [IN] OBSERVATION

The LSM tree's replay method (lsm.py:28) returns List[Tuple[str, bytes]], discarding WAL sequence numbers and making gap detection impossible at the LSM layer

Source: entries/2026/05/29/topic-gap-detection-during-replay.md

lsm-requires-wal-hash-does-not [IN] OBSERVATION

The LSM tree uses a separate WAL (lsm.py:14-64) for crash recovery because its memtable is volatile; the hash index writes directly to the append-only data file, which serves as its own recovery log and needs no WAL.

Source: entries/2026/05/29/topic-ddia-chapter3-hash-indexes.md

lsm-sparse-index-default-16 [IN] OBSERVATION

The SSTable sparse index samples every 16th key by default; a point lookup binary-searches the sparse index then linear-scans up to 16 entries within the candidate block.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-sstables-unprotected-mutation [IN] OBSERVATION

LSMTree.sstables is mutated by both flush() (append) and compact() (full replacement) with no synchronization, versioning, or ref counting, making concurrent reads unsafe if threading or async is added

Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

lsm-tests-single-threaded-only [IN] OBSERVATION

The LSM test suite (testlsm.py) runs all operations sequentially with no concurrency, so race conditions between flush(), compact(), and range_scan() are never exercised

Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

lsm-tombstone-is-empty-bytes [IN] OBSERVATION

Deletion is represented as TOMBSTONE = b"" (empty bytes), which means empty-byte-string values and deleted keys are indistinguishable at the storage layer.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-tree-has-no-level-concept [IN] OBSERVATION

LSMTree in lsm.py maintains a flat list of SSTables ordered by sequence number with no level assignment; compaction merges all files into one rather than promoting between levels

Source: entries/2026/05/29/topic-leveldb-version-set.md

lsm-two-merge-strategies [IN] OBSERVATION

Compaction uses heapq-based k-way merge with prevkey deduplication (lsm.py ~line 323), while rangescan uses a dict-based materialize-everything approach (lines 275–298) — two fundamentally different merge strategies in the same codebase

Source: entries/2026/05/29/topic-leveldb-merging-iterator.md

lsm-uses-sortedcontainers [IN] OBSERVATION

The LSM memtable uses sortedcontainers.SortedDict — the only external dependency across the four storage engine modules examined so far.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-wal-before-memtable [IN] OBSERVATION

Every put() and delete() writes to the WAL before inserting into the memtable, ensuring no acknowledged write is lost on crash.

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lsm-wal-crash-tolerant-replay [IN] OBSERVATION

LSM WAL replay() silently discards any trailing partial record by checking remaining bytes at every parse step, making it tolerant of process crashes during append()

Source: entries/2026/05/29/log-structured-merge-tree-lsm-WAL.md

lsm-wal-entry-size [IN] OBSERVATION

Each LSM WAL entry is exactly 8 + len(key_utf8) + len(value) bytes: two 4-byte big-endian length headers plus the raw payloads, with no checksum or operation-type field

Source: entries/2026/05/29/topic-write-amplification-measurement.md

lsm-wal-has-no-checksums [IN] OBSERVATION

The LSM tree's WAL (lsm.py:13-63) uses length-prefixed records with no CRC, relying solely on short-read detection for crash recovery; distinct from the separate lsm-wal-has-no-fsync issue

Source: entries/2026/05/29/topic-wal-corruption-models.md

lsm-wal-has-no-commit-markers [IN] OBSERVATION

The LSM WAL (lsm.py:13-65) appends bare key-value pairs with no commit or transaction boundaries, safe only because structural changes (compaction, SSTable creation) use atomic file operations outside the WAL

Source: entries/2026/05/29/topic-wal-operation-boundaries.md

lsm-wal-has-no-fsync [IN] OBSERVATION

The WAL class in log-structured-merge-tree/lsm.py calls flush() but never os.fsync(), making it strictly weaker than the standalone WAL which offers configurable sync modes

Source: entries/2026/05/28/topic-directory-fsync-semantics.md

lsm-wal-has-no-integrity-check [IN] OBSERVATION

The WAL in log-structured-merge-tree/lsm.py uses only length-prefix framing with no CRC or checksum; a single bit flip in the length field causes silent data corruption or misaligned reads

Source: entries/2026/05/29/topic-block-aligned-wal-records.md

lsm-wal-no-checksums [IN] OBSERVATION

The LSM WAL format has no checksums or magic bytes; it can detect truncation (incomplete length-prefixed records) but not corruption of existing bytes

Source: entries/2026/05/29/log-structured-merge-tree-lsm-WAL.md

lsm-wal-no-crc [IN] OBSERVATION

The LSM tree's WAL (lsm.py:14-53) uses only length-prefixed framing with no checksum, making it unable to distinguish corruption from valid data unless a length field causes an out-of-bounds read

Source: entries/2026/05/29/topic-record-boundary-recovery.md

lsm-wal-replays-on-reopen [IN] OBSERVATION

Constructing a new LSMTree on an existing directory replays the WAL to recover unflushed memtable state; this is validated by the crash recovery test which skips close() and reopens

Source: entries/2026/05/28/log-structured-merge-tree-test_lsm.md

lsm-wal-strictly-weaker-than-standalone-wal [IN] OBSERVATION

The LSM tree's built-in WAL provides strictly weaker guarantees than the standalone WAL module along two independent axes: it has no CRC or checksum for integrity (vs per-record CRC32 in the standalone WAL), and its truncation zeroes the entire file instantly via wb mode (vs careful record-by-record rewriting), meaning corruption goes undetected and truncation is all-or-nothing rather than surgical.

lsm-wal-truncate-destroys-immediately [IN] OBSERVATION

The LSM tree's WAL.truncate() opens the file with "wb" mode which zeroes all content instantly, with no intermediate durable state to recover from if a crash occurs before the file is reopened for appending

Source: entries/2026/05/29/topic-crash-safety-of-truncate.md

lsm-wal-truncate-no-arg [IN] OBSERVATION

The LSM tree at lsm.py:314 calls self._wal.truncate() with no sequence argument, indicating it uses a different WAL interface than the standalone write-ahead-log/wal.py module

Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

lsm-wal-truncate-no-error-recovery [IN] OBSERVATION

If any open() or close() call within WAL.truncate() fails, the exception propagates uncaught and can leave self._fd holding a closed handle, breaking subsequent append() calls

Source: entries/2026/05/29/log-structured-merge-tree-lsm-truncate.md

lsm-wal-truncate-wb-then-ab [IN] OBSERVATION

WAL.truncate() uses a two-open sequence — open in "wb" mode to truncate the file, then reopen in "ab" mode for appending — because "wb" positions the cursor at offset 0 which is unsafe for a WAL

Source: entries/2026/05/29/log-structured-merge-tree-lsm-truncate.md

lsm-wal-truncated-after-sstable-write [IN] OBSERVATION

The WAL is truncated only after a successful SSTable write; if a crash occurs between SSTable.write and WAL.truncate, replay harmlessly re-inserts into the memtable (idempotent because it's a dict).

Source: entries/2026/05/28/log-structured-merge-tree-lsm.md

lww-auto-increments-timestamp [IN] OBSERVATION

Sequential LWWRegister.set() calls without explicit timestamps produce monotonically increasing timestamps (auto-incremented), so ordering is preserved even without caller-supplied clocks.

Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

lww-register-tiebreaks-on-replica-id [IN] OBSERVATION

When two LWWRegister writes have identical timestamps, the one with the lexicographically higher replica_id wins, making conflict resolution deterministic but arbitrary

Source: entries/2026/05/29/topic-delta-state-crdts.md

lww-tiebreak-is-deterministic [IN] OBSERVATION

LWWRegister.merge uses (timestamp, writer_id) tuple comparison, making conflict resolution a total order with no ambiguity since replica IDs are unique

Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

macos-fsync-incomplete-without-fullfsync [IN] OBSERVATION

On macOS/Darwin, os.fsync() does not guarantee flushing the disk write cache to stable storage; only fcntl(fd, F_FULLFSYNC) provides that guarantee, meaning all 13 fsync sites in the codebase may provide no power-loss durability on the development platform

Source: entries/2026/05/29/topic-sqlite-durability-model.md

macos-fsync-not-durable [IN] OBSERVATION

On the development platform (Darwin/APFS), os.fsync() may not flush the disk write cache; true durability requires fcntl(fd, F_FULLFSYNC), which is never used at any of the 13 fsync call sites in the codebase.

Source: entries/2026/05/29/topic-posix-durability-guarantees.md

macos-fsync-not-durable-without-fullfsync [IN] OBSERVATION

On macOS (the development platform), os.fsync() may not flush the disk's hardware write cache; true power-loss durability requires fcntl(fd, F_FULLFSYNC) which is never used in the codebase.

Source: entries/2026/05/29/topic-fsync-ordering-guarantees.md

macos-fsync-weaker-than-implied [IN] OBSERVATION

The codebase uses os.fsync() exclusively, which on macOS/APFS does not flush the disk write cache; true durability requires fcntl(fd, F_FULLFSYNC) which is never used anywhere

Source: entries/2026/05/29/topic-shadow-paging-or-steal-no-force.md

map-side-join-cartesian-on-duplicates [IN] OBSERVATION

When multiple records share the same join key value, all three strategies produce the cartesian product of the matching left and right groups

Source: entries/2026/05/29/map-side-join-mapsidejoins.md

map-side-join-conflict-prefix [IN] OBSERVATION

When left and right datasets share a non-key field name, mergerecords disambiguates with left and right prefixes; this convention applies in all join types including None-fill paths for left joins

Source: entries/2026/05/29/map-side-join-mapsidejoins.md

map-side-join-missing-key-skip [IN] OBSERVATION

Records missing the join key are silently dropped and counted in stats["skipped_records"] rather than raising exceptions, across all three join strategies

Source: entries/2026/05/29/map-side-join-mapsidejoins.md

map-side-join-no-real-parallelism [IN] OBSERVATION

Mapper parallelism is simulated via round-robin chunking and mapperid tagging on output records; no threads or processes are used

Source: entries/2026/05/29/map-side-join-mapsidejoins.md

map-side-join-three-strategies [IN] OBSERVATION

The module implements exactly three join strategies (broadcast hash, partitioned hash, sort-merge) that all produce identical inner-join results when verified by comparejoinstrategies

Source: entries/2026/05/29/map-side-join-mapsidejoins.md

map-side-joins-track-mapper-id-on-output [IN] OBSERVATION

All three map-side join strategies attach a mapperid field to each output record, enabling verification that partitions/mappers operated independently

Source: entries/2026/05/29/topic-ddia-ch10-reduce-side-joins.md

mapreduce-combiner-transparent [IN] OBSERVATION

Applying a combiner never changes final results; it only reduces intermediate record volume — the combiner must be semantically invisible

Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

mapreduce-materializes-at-phase-boundaries [IN] OBSERVATION

MapReduce in mapreduce.py writes intermediate data to JSON partition files between map and reduce phases, creating a full materialization barrier that prevents streaming from mapper to reducer

Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

mapreduce-results-deterministic [IN] OBSERVATION

MapReduceJob.run() produces identical output regardless of nummappers/numreducers configuration for the same input data — the shuffle/partition step is deterministic

Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

mapreduce-run-accepts-filepath [IN] OBSERVATION

MapReduceJob.run() accepts either a list of (key, value) tuples or a file path string as input, with the file path handled transparently

Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

mapreduce-stats-tracks-records [IN] OBSERVATION

job.stats tracks mapinputrecords, mapoutputrecords, reduceoutputrecords, worker counts, and elapsed time after each run

Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

mapreduce-strict-mode-default [IN] OBSERVATION

MapReduceJob defaults to strict mode where mapper/reducer exceptions propagate to the caller; fault_tolerant=True must be explicitly set to enable silent error skipping

Source: entries/2026/05/29/mapreduce-framework-test_mapreduce.md

materialization-barriers-span-storage-and-processing [IN] OBSERVATION

Unbounded in-memory materialization is a cross-cutting concern from storage through batch processing: storage operations (LSM range scans, compaction, SSTable index rebuilds) materialize entire datasets, and pipeline stages (Count) silently accumulate all input before producing output, meaning the materialization problem exists at every layer of the data processing stack.

max-keys-prevents-overflow [IN] OBSERVATION

The maxkeysperpage parameter bounds entries per node so that serialized output fits within pagesize, with splits triggered before the limit is exceeded

Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

maybe-rotate-requires-lock [IN] OBSERVATION

mayberotate must be called under self.lock but does not acquire or assert the lock itself; all three call sites (append, appendbatch, checkpoint) satisfy this obligation.

Source: entries/2026/05/29/write-ahead-log-wal-mayberotate.md

membership-correctness-masks-hint-data-loss [IN] OBSERVATION

The gossip membership lifecycle correctly detects and removes crashed hint-holding nodes, but this membership-layer correctness masks data-layer failure: hints stored on the crashed node are permanently lost with no recovery mechanism or error indication, and the cluster reports the failure as handled because the membership event was processed successfully.

membership-detection-unreliable-in-accuracy-and-propagation [IN] OBSERVATION

The membership subsystem is unreliable at two independent levels that interact destructively: failure detection is both the single correctness bottleneck AND permanently miscalibrated (no arrival-rate history for adaptive thresholds), while membership and data convergence operate at fundamentally different rates (O(log N) membership gossip vs O(N) data propagation in ring topology), meaning incorrect liveness decisions propagate through membership faster than data can adjust to them.

memtable-threshold-bounds-recovery [IN] OBSERVATION

The memtable_threshold parameter (lsm.py:202) directly caps the maximum number of WAL entries that must be replayed on crash recovery, because the WAL is truncated on every flush.

Source: entries/2026/05/29/topic-wal-checkpoint-protocol.md

merge-does-not-delete-inputs [IN] OBSERVATION

merge_sstables never deletes or modifies input SSTable files; cleanup is the caller's responsibility, and the current CompactionManager callers also do not delete old files from disk.

Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

merge-mutates-self-and-returns-self [IN] OBSERVATION

Every CRDT merge() method mutates the receiver in place and returns self, enabling chaining but meaning callers must deepcopy() before merging if the original state must be preserved

Source: entries/2026/05/29/topic-delta-state-crdts.md

merge-no-atomic-write [IN] OBSERVATION

mergesstables writes directly to outputpath with no temp-file-and-rename pattern, so a crash mid-merge leaves a corrupt partial file at the target path.

Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

merge-preserves-tombstones-by-default [IN] OBSERVATION

mergesstables writes tombstones (value=None) to the output unless removetombstones=True is explicitly passed, and no current call site in CompactionManager enables removal.

Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

merge-sstables-tombstone-flag-is-caller-controlled [IN] OBSERVATION

mergesstables() accepts a removetombstones boolean from the caller rather than computing safety from level metadata, so correctness depends entirely on the caller passing the right value

Source: entries/2026/05/29/topic-leveled-compaction-tombstone-policy.md

merkle-diff-asymmetric-leaf-counts [IN] OBSERVATION

Two trees with different leafcount but the same paddedsize can be diffed; positions where one tree has data and the other has EMPTY_HASH padding are reported as diffs

Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

merkle-diff-enables-efficient-sync [IN] OBSERVATION

MerkleTree.diff compares two trees in O(log n) by short-circuiting at matching subtree hashes, providing the divergence-detection layer that would feed receivereplica with only changed keys rather than a full key scan

Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

merkle-diff-enables-targeted-key-range-repair [IN] OBSERVATION

Merkle tree diffs produce sorted leaf indices that map directly to key ranges when the tree is built with key-sorted input, enabling anti-entropy repair that transfers only divergent key ranges rather than full dataset comparisons.

merkle-diff-filters-padding [IN] OBSERVATION

diff() excludes padding indices beyond max(self.leafcount, other.leafcount), using max (not min) so positions where one tree has data and the other has EMPTY_HASH padding are still reported as diffs

Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

merkle-diff-is-pure [IN] OBSERVATION

diff() and diffrecursive are read-only — they mutate neither tree, allocate only a local result list, and produce no side effects

Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

merkle-diff-order-is-ascending [IN] OBSERVATION

Differing leaf indices returned by diff() are in ascending order as a consequence of left-to-right DFS traversal in diffrecursive

Source: entries/2026/05/29/merkle-tree-merkle_tree-diff.md

merkle-diff-prunes-matching-subtrees [IN] OBSERVATION

MerkleTree.diffrecursive() returns immediately when subtree hashes match, making the diff cost proportional to the number of divergent keys rather than total tree size — the logarithmic efficiency that makes anti-entropy practical.

Source: entries/2026/05/29/topic-anti-entropy-vs-read-repair.md

merkle-diff-returns-leaf-indices [IN] OBSERVATION

MerkleTree.diff() returns a list of integer leaf indices where the two trees diverge, not subtree nodes or hash pairs

Source: entries/2026/05/29/merkle-tree-test_merkle.md

merkle-get-proof-rejects-padding-indices [IN] OBSERVATION

getproof bounds-checks against leafcount (real leaves), not padded_size, so proofs cannot be generated for padding slots even though they have valid hashes in the backing array

Source: entries/2026/05/29/merkle-tree-merkletree-getproof.md

merkle-hashes-over-hex-not-bytes [IN] OBSERVATION

Internal node hashes are computed over concatenated 64-char hex digest strings (128 ASCII bytes) rather than raw 32-byte SHA-256 digests, resulting in 4x more data per hash input.

Source: entries/2026/05/29/merkle-tree-merkletree-verifyproof.md

merkle-proof-direction-is-sibling-position [IN] OBSERVATION

The direction field in each proof sibling tuple indicates where the sibling sits (left or right), not the current node — this controls hash concatenation order during verification

Source: entries/2026/05/29/merkle-tree-merkletree-getproof.md

merkle-proof-is-snapshot [IN] OBSERVATION

Merkle proofs are point-in-time snapshots; a proof generated before update_leaf will fail verification against the new root hash, and nothing invalidates the stale proof in-place

Source: entries/2026/05/29/merkle-tree-merkletree-getproof.md

merkle-proof-length-equals-height [IN] OBSERVATION

A proof for a tree with paddedsize = 2^h contains exactly h sibling entries, one per tree level excluding the root

Source: entries/2026/05/29/merkle-tree-merkletree-getproof.md

merkle-proof-sibling-order-bottom-up [IN] OBSERVATION

getproof returns siblings ordered from leaf level to root level (bottom-up), and verifyproof consumes them in that same order

Source: entries/2026/05/29/merkle-tree-merkletree-getproof.md

merkle-proof-size-is-log-n [IN] OBSERVATION

A Merkle proof contains exactly height sibling hashes where height = log2(nextpowerof2(leafcount)), making both proof size and verification cost O(log N).

Source: entries/2026/05/29/topic-merkle-proof-security-model.md

merkle-tree-array-indexing [IN] OBSERVATION

The tree uses 0-indexed implicit array layout where children of node i are at 2i+1 and 2i+2, and leaves start at index padded_size - 1

Source: entries/2026/05/29/merkle-tree-merkle_tree.md

merkle-tree-hashes-raw-bytes [IN] OBSERVATION

The Merkle tree implementation avoids serialization canonicalization entirely by accepting caller-provided bytes and hashing them directly via hashlib.sha256(data), making it immune to the JSON encoding ambiguity that affects the PBFT digest.

Source: entries/2026/05/29/topic-json-canonicalization-risks.md

merkle-tree-height-formula [IN] OBSERVATION

A MerkleTree with 4 leaves has height == 2, indicating height counts internal levels (log₂ of padded leaf count)

Source: entries/2026/05/29/merkle-tree-test_merkle.md

merkle-tree-supports-non-power-of-two [IN] OBSERVATION

The implementation handles non-power-of-2 leaf counts (tested with 1 and 3 leaves) via internal padding to the next power of 2 using EMPTY_HASH

Source: entries/2026/05/29/merkle-tree-test_merkle.md

merkle-tree-uses-sha256-for-cross-replica-verification [IN] OBSERVATION

The Merkle tree uses SHA-256 (merkletree.py:11) for tamper-evident cross-replica verification via verifyproof, while all storage engines use CRC32 — reflecting the split between accidental-corruption detection on trusted local disk and adversarial-integrity across trust boundaries

Source: entries/2026/05/29/topic-crc32-vs-cryptographic-hash.md

meta-page-fixed-20-byte-payload [IN] OBSERVATION

The metadata payload is exactly 20 bytes (5 big-endian uint32s packed as '>5I'), zero-padded to pagesize; the format requires pagesize >= 20 but this is not enforced.

Source: entries/2026/05/29/b-tree-storage-engine-btree-writemeta.md

meta-page-is-page-zero [IN] OBSERVATION

The B-tree metadata (root, height, totalkeys, nextfree, freehead) is always stored at file offset 0 in page 0, which is reserved as the superblock and occupies exactly pagesize bytes.

Source: entries/2026/05/29/b-tree-storage-engine-btree-writemeta.md

more-vnodes-monotonically-improves-balance [IN] OBSERVATION

The test testvnodecountaffectsbalance asserts that imbalance at 500 vnodes is strictly less than at 1 vnode, and the statistical model (variance ∝ 1/V) guarantees monotonic improvement.

Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

mr-combiner-is-map-side-only [IN] OBSERVATION

The combiner runs inside runmapper on each mapper's output for a single partition independently; it never sees data from other mappers or other partitions.

Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

mr-fault-tolerant-silently-drops [IN] OBSERVATION

When fault_tolerant=True, mapper and reducer exceptions are caught and silently skipped with no logging or error reporting, producing partial results without any indication of data loss.

Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

mr-groupby-requires-presort [IN] OBSERVATION

runreducer sorts all intermediate pairs by key before calling itertools.groupby, which is required because groupby only groups consecutive elements with the same key.

Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

mr-hash-partitions-keys [IN] OBSERVATION

MapReduceJob assigns intermediate keys to reducer partitions using hash(key) % num_reducers, ensuring all values for a given key reach the same reducer.

Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

mr-intermediate-data-uses-json-files [IN] OBSERVATION

Intermediate shuffle data is serialized as JSON files in a temp directory with naming convention map-{mapper_id}-part-{partition}.json.

Source: entries/2026/05/29/mapreduce-framework-mapreduce.md

multi-leader-conflict-log-records-strategy [IN] OBSERVATION

Conflicts are recorded in conflict_log with metadata including the key and the ConflictStrategy enum value that resolved them.

Source: entries/2026/05/29/multi-leader-replication-testmultileader.md

multi-leader-conflict-record-captures-both-values [IN] OBSERVATION

ConflictRecord stores both localvalue and remotevalue along with the key and resolution strategy, providing a complete audit trail for every conflict resolution.

Source: entries/2026/05/29/multi-leader-replication-testertestmulti_leader.md

multi-leader-convergence-skips-origin [IN] OBSERVATION

allconverged() compares (value, timestamp, istombstone) but intentionally ignores originnodeid (tuple index 2), since different arrival orders can record different origins for the same resolved state

Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

multi-leader-custom-merge-idempotent-across-syncs [IN] OBSERVATION

Custom merge functions are applied once per conflict; repeated sync rounds on already-converged values must not re-trigger the merge, preventing bugs like counters doubling on every sync cycle.

Source: entries/2026/05/29/multi-leader-replication-testertestmulti_leader.md

multi-leader-custom-merge-new-timestamp [IN] OBSERVATION

Custom merge resolution creates a new timestamp (max(localts, remotets) + 1) and canonical origin so the merged result supersedes both conflicting inputs in any future LWW comparison

Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

multi-leader-custom-merge-requires-merge-fn [IN] OBSERVATION

Constructing a cluster with CUSTOMMERGE strategy and mergefn=None is accepted silently, but raises TypeError at the first actual conflict during sync()

Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

multi-leader-idempotent-apply [IN] OBSERVATION

Each (key, timestamp, originnode) triple is applied at most once per node, tracked by the seen set, preventing duplicate application in ring topologies where changes propagate through multiple hops

Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

multi-leader-lamport-monotonic [IN] OBSERVATION

Successive put() calls on the same ReplicaNode yield strictly increasing Lamport timestamps.

Source: entries/2026/05/29/multi-leader-replication-testmultileader.md

multi-leader-lww-deterministic-tiebreak [IN] OBSERVATION

LWW conflict resolution compares (timestamp, node_id) tuples using Python's lexicographic tuple comparison, guaranteeing all nodes independently reach the same winner without coordination

Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

multi-leader-no-tombstone-gc [IN] OBSERVATION

Multi-leader replication stores tombstones indefinitely with no compaction or garbage collection method, which is safe but accumulates dead entries without bound

Source: entries/2026/05/29/topic-sstable-tombstone-gc-safety.md

multi-leader-ring-convergence-rounds [IN] OBSERVATION

RING topology requires at least N−1 sync() rounds to fully propagate a single-source change across N nodes, because each round advances the change by exactly one hop

Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

multi-leader-ring-needs-multiple-rounds [IN] OBSERVATION

Ring topology requires at least 2 sync rounds to propagate a write across 3+ nodes, unlike all-to-all topology which converges in a single round; syncuntilconverged() loops to handle this.

Source: entries/2026/05/29/multi-leader-replication-testertestmulti_leader.md

multi-leader-sync-collects-then-distributes [IN] OBSERVATION

sync() drains all pending queues from every node before distributing any changes, preventing intra-round cascading where one node's change would trigger further propagation within the same sync round

Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

multi-leader-sync-count-includes-noops [IN] OBSERVATION

The count returned by sync() includes idempotent no-op deliveries (changes already in the target's _seen set), so it reflects replication attempts, not accepted changes

Source: entries/2026/05/29/multi-leader-replication-multi_leader-MultiLeaderCluster.md

multi-leader-sync-designed-for-safe-convergence [IN] OBSERVATION

Multi-leader sync is designed for safe convergence: the two-phase collect-then-distribute pattern prevents intra-round cascading, custom merge functions are required to be idempotent across repeated syncs, and merged results receive fresh timestamps ensuring they supersede both inputs.

multi-leader-sync-is-discrete [IN] OBSERVATION

Replication occurs only when sync() is explicitly called; there is no background replication thread or event-driven propagation, giving tests full control over replication timing.

Source: entries/2026/05/29/multi-leader-replication-testertestmulti_leader.md

multi-leader-tombstone-delete [IN] OBSERVATION

Deletes are implemented as tombstone writes (TOMBSTONE sentinel with istombstone=True) that replicate like normal mutations; get() returns None for tombstoned keys

Source: entries/2026/05/29/multi-leader-replication-multi_leader.md

multiple-core-components-assume-single-thread-silently [IN] OBSERVATION

The B-tree, garbage collector, and consistent hash ring all assume single-threaded access without locks, assertions, or documentation — a systematic pattern where concurrency unsafety is implicit rather than enforced.

mvcc-active-set-frozen-at-begin [IN] OBSERVATION

MVCCDatabase.begintransaction captures activeat_start as a frozen set of all currently active transaction IDs at that moment; it is never modified after creation and serves as the transaction's immutable snapshot boundary

Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

mvcc-active-tx-not-recoverable [IN] OBSERVATION

Active (uncommitted) transactions cannot survive a process restart; the safe recovery strategy is to treat them as aborted, which is consistent with isvisible's existing handling of aborted transaction versions as invisible

Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

mvcc-append-only-versions [IN] OBSERVATION

Writers never modify existing Version objects (except self-overwrites); new values are appended to per-key version lists, preserving the immutable snapshot for concurrent readers

Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

mvcc-counters-must-be-monotonic [IN] OBSERVATION

nexttxid and nexttimestamp must never reissue a previously used value; if they reset to 1 on restart, visibility comparisons like createdby < tx.tx_id silently produce wrong results by conflating old and new transaction IDs

Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

mvcc-no-persistence [IN] OBSERVATION

MVCCDatabase has zero serialization or persistence support; all state (versions, transactions, committed, aborted, counters) lives only in memory with no todict, fromdict, or file I/O of any kind

Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

mvcc-pattern-exists-at-row-level [IN] OBSERVATION

The codebase implements MVCC visibility (ssidatabase.py:visiblevalue, mvccdatabase.py:Version) at the row/key level but not at the SSTable-file level where LevelDB uses Version reference counting for concurrent reader isolation

Source: entries/2026/05/29/topic-mvcc-version-snapshots.md

mvcc-three-layer-visibility-model [IN] OBSERVATION

MVCC visibility is enforced through three complementary invariants: append-only version storage prevents lost updates, own-writes are always visible regardless of commit state, and deletions follow the same visibility rules as writes — collectively ensuring snapshot consistency.

mvcc-txid-monotonicity-assumed [IN] OBSERVATION

The visibility check createdby < tx.txid assumes transaction IDs are monotonically increasing to mean "started before us"; non-monotonic or reused IDs would break snapshot correctness.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-is_visible.md

mvcc-uncommitted-data-memory-only [IN] OBSERVATION

In MVCCDatabase, uncommitted transaction writes exist only as in-memory Version objects in versions and never reach a persistent data store; abort works by adding the txid to _aborted so visibility checks filter it out

Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

mvcc-visibility-depends-on-tx-metadata [IN] OBSERVATION

isvisible requires committed, aborted, and each transaction's activeatstart set; persisting version chains without this metadata makes them unreadable after restore since visibility cannot be determined

Source: entries/2026/05/29/topic-snapshot-disk-persistence.md

naming-convention-not-uniform [IN] OBSERVATION

The tester/test split uses at least three naming patterns across directories: testertest*.py, test*tester.py, and testtester*.py

Source: entries/2026/05/29/topic-tester-test-pattern.md

neither-bitcask-supports-expiry [IN] OBSERVATION

Neither Bitcask implementation supports per-key TTL or expiry; hash-index-storage stores a timestamp in the header but never checks it for expiration, and log-structured-hash-table omits timestamps entirely.

Source: entries/2026/05/29/topic-bitcask-paper-vs-implementation.md

neither-partitioning-strategy-fully-reliable [IN] OBSERVATION

Neither partitioning strategy offers both reliable routing and ordered access: range partitioning preserves key ordering but depends on maintaining parallel arrays in lockstep with invariants that are never verified (boundaries/partitions correspondence, adjacency assumptions in merge), while hash partitioning provides deterministic single-lookup routing but permanently destroys lexicographic key order, making range queries impossible.

network-partitions-amplify-write-correctness-gaps [IN] OBSERVATION

Network partitions create a force multiplier on distributed write correctness gaps: partitions disrupt gossip-based failure detection (the single correctness gate for replication, read repair, and hinted handoff), while writes that do proceed operate under already-weakened quorum semantics (sloppy quorums count hints as successes, sub-quorum configurations accepted with only a warning).

next-free-page-is-monotonic [IN] OBSERVATION

The nextfreepage metadata field only ever increases — it is the high-water mark of file extension, never decremented even when pages are freed; file size never shrinks, only interior pages are recycled via the free list.

Source: entries/2026/05/29/topic-free-list-reuse-under-churn.md

no-atomic-file-creation [IN] OBSERVATION

No SSTable writer or compaction routine in any implementation uses the write-temp/fsync/rename pattern; all write directly to the final file path

Source: entries/2026/05/29/topic-write-to-temp-then-rename-pattern.md

no-ci-or-runner-for-tester-tests [IN] OBSERVATION

No CI pipeline (no .github/ directory), Makefile, or orchestration script exists to invoke testertest*.py files; they are run manually via python testertest*.py

Source: entries/2026/05/29/topic-tester-runner-system.md

no-compaction-manifest [IN] OBSERVATION

None of the three storage engines (hash-index, log-structured-hash-table, LSM) use a manifest or compaction log to make segment replacement atomic; segment discovery is purely filesystem-based via directory listing.

Source: entries/2026/05/28/topic-bitcask-compaction-atomicity.md

no-compaction-rate-limiting [IN] OBSERVATION

Neither LSM implementation throttles compaction I/O throughput; a large merge will saturate disk bandwidth without regard to concurrent foreground reads or writes.

Source: entries/2026/05/29/topic-bitcask-merge-window-scheduling.md

no-component-resyncs-after-corruption [IN] OBSERVATION

Every binary record reader in this codebase (WAL, both Bitcasks, LSM WAL, SSTable, B-tree WAL) stops at the first CRC failure, short read, or parse error; none attempt to scan forward for the next valid record boundary

Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

no-concurrency-primitives [IN] OBSERVATION

The entire B-tree implementation contains zero locking, latching, mutex, or thread-safety constructs; it is strictly single-threaded with no concurrent-access support

Source: entries/2026/05/28/topic-concurrent-btree-deletion.md

no-concurrency-protection-on-base-offsets [IN] OBSERVATION

Neither truncate nor compactpartition holds a lock; concurrent calls on the same partition would race on both partitions and baseoffsets, assuming single-threaded execution.

Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

no-conftest-anywhere [IN] OBSERVATION

The ddia-implementations repo contains zero conftest.py files at any directory level — confirmed by exhaustive grep across all 37 modules, not just root

Source: entries/2026/05/29/topic-conftest-hierarchy.md

no-cross-file-fsync-ordering-protocol [IN] OBSERVATION

The standalone WAL module provides no mechanism or documentation for enforcing the required WAL-before-data fsync ordering; callers must implement the fsync(wal) → write(data) → fsync(data) protocol themselves.

Source: entries/2026/05/29/topic-fsync-ordering-guarantees.md

no-cuckoo-filter-in-codebase [IN] OBSERVATION

The repository contains no cuckoo filter implementation; the counting Bloom filter is the only deletion-supporting probabilistic filter, leaving the saturation problem unsolved

Source: entries/2026/05/29/topic-counter-overflow-and-cuckoo-filters.md

no-cuckoo-or-xor-filter-in-repo [IN] OBSERVATION

The repository has no cuckoo filter, quotient filter, or xor filter implementation; CountingBloomFilter is the only probabilistic membership structure supporting deletion.

Source: entries/2026/05/29/topic-cuckoo-filters.md

no-data-path-is-trustworthy [IN] OBSERVATION

No data path through the system is trustworthy: the primary storage path through compaction is unsalvageable (crash-unsafe within a broken durability pipeline, no concurrent access protection) while the secondary derived-data path through event infrastructure is unreliable in both content semantics and ordering guarantees, leaving no channel through which data can be read or propagated with confidence.

no-defense-in-depth-against-corruption [IN] OBSERVATION

The system has no defense-in-depth against data corruption: input validation is systematically absent at API boundaries (malformed data enters freely), and corruption is terminal across all readers with no resync capability (corrupted data can never be recovered) — the system is maximally permeable to corruption entry and maximally fragile to corruption presence.

no-direct-page-writes [IN] OBSERVATION

The BTree class never calls pm.writepage directly for data pages; all data page writes are routed through walwritepage (or walwrite_meta for page 0) to maintain the write-ahead invariant.

Source: entries/2026/05/29/b-tree-storage-engine-btree-walwrite_page.md

no-directory-fsync-anywhere [IN] OBSERVATION

No implementation in the codebase calls os.fsync() on a directory file descriptor; all fsync calls target data file descriptors only

Source: entries/2026/05/28/topic-directory-fsync-semantics.md

no-disk-artifact-patterns [IN] OBSERVATION

On-disk files created by storage-engine modules (B-tree pages, WAL segments, SSTable files) are not covered by .gitignore, relying on test teardown or manual cleanup

Source: entries/2026/05/29/unknown.md

no-double-free-detection [IN] OBSERVATION

Calling freepage twice on the same page creates a cycle in the free list, which allocatepage cannot detect, leading to silent data corruption through repeated page reuse.

Source: entries/2026/05/29/b-tree-storage-engine-btree-free_page.md

no-fixed-block-boundaries-in-sstable [IN] OBSERVATION

Both SSTable implementations write variable-length records contiguously with no fixed-size block structure, making mid-file resynchronization after corruption impossible — a corrupted length field causes the reader to consume all subsequent valid records as garbage.

Source: entries/2026/05/29/topic-block-based-wal-format.md

no-high-key-fence [IN] OBSERVATION

No node stores a high-key (upper-bound fence key); searchers cannot detect mid-split state, which would be required for Lehman & Yao's concurrent search protocol

Source: entries/2026/05/29/topic-lehman-yao-1981.md

no-high-key-per-node [IN] OBSERVATION

Nodes store no upper-bound "high key," so a reader cannot detect that a concurrent split moved keys to a right sibling; this is the second missing prerequisite for Lehman-Yao's single-latch protocol

Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

no-leader-lease-in-codebase [IN] OBSERVATION

Neither the Raft nor TOB implementations include leader lease or ReadIndex optimizations; all read and write operations pay full consensus cost.

Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

no-leak-detection-mechanism [IN] OBSERVATION

The B-tree has no allocation bitmap, page-reachability check, or post-recovery consistency scan to detect leaked or orphaned pages; once a page is lost between the tree and the free list, it is silently unrecoverable.

Source: entries/2026/05/29/topic-free-list-corruption-risks.md

no-o-direct-usage [IN] OBSERVATION

No file in the ddia-implementations codebase uses ODIRECT, os.open(), or any os.O flags; all I/O goes through Python's buffered open() and the kernel page cache

Source: entries/2026/05/29/topic-o-direct-and-dio.md

no-operational-path-to-correctness [IN] OBSERVATION

There is no operational path to system correctness: the system has no stable regime at any scale (storage degrades monotonically, failures widen the correctness gap, no paradigm is both scalable and self-healing), AND even if a stable state existed, correctness could not be verified (the gap between specification and implementation widens under every failure mode and testing validates the wrong model).

no-page-size-enforcement-in-serialize [IN] OBSERVATION

Neither serializeinternal nor serializeleaf checks whether the serialized output fits within pagesize; overflow prevention is the caller's responsibility via the maxkeys limit in BTree._insert.

Source: entries/2026/05/29/b-tree-storage-engine-btree-serializeinternal.md

no-pytest-config-exists [IN] OBSERVATION

The repository contains no pyproject.toml, pytest.ini, setup.cfg, or root conftest.py; pytest runs with pure default settings

Source: entries/2026/05/29/topic-pytest-configuration.md

no-refcount-in-codebase [IN] OBSERVATION

No file in the repository implements reference counting on SSTable files, Version objects, or any storage-layer snapshot structure

Source: entries/2026/05/29/topic-mvcc-version-lifetime.md

no-safe-operating-mode-exists [IN] OBSERVATION

The system has no safe operating mode in any scenario: it is unsafe under concurrent access (both read and write paths lack synchronization across core components) AND unsafe during crash recovery (no storage engine has a fully safe recovery path), meaning correctness is unachievable whether the system is running normally under load or recovering from failure.

no-schema-validation [IN] OBSERVATION

The batch pipeline performs no validation that adjacent stages produce/consume compatible record formats; mismatched tuple shapes fail at runtime with indexing errors

Source: entries/2026/05/29/batch-word-count-pipeline.md

no-shadow-paging-implementation-exists [IN] OBSERVATION

The codebase contains no copy-on-write or shadow paging implementation; all three storage engines use write-ahead logging exclusively for crash recovery

Source: entries/2026/05/29/topic-wal-vs-shadow-paging.md

no-sibling-sentinel-is-max-uint32 [IN] OBSERVATION

NO_SIBLING = 0xFFFFFFFF marks the end of the leaf sibling chain; a leaf with this value is the rightmost at its level

Source: entries/2026/05/29/topic-blink-tree-paper.md

no-storage-paradigm-is-both-scalable-and-self-healing [IN] OBSERVATION

No storage paradigm offers both scalability and structural resilience: hash indexes are memory-bound, LSM miss-probes are linear due to missing Bloom integration, and all paradigms degrade monotonically after crashes with no safe recovery path (crash-unsafe compaction, incomplete CRC, replay without batch atomicity).

no-structured-topology-support [IN] OBSERVATION

The codebase has no ring, star, or mesh topology modes for gossip; peer selection is always uniform random from all active nodes.

Source: entries/2026/05/29/topic-topology-propagation-bounds.md

no-subsystem-has-viable-recovery-strategy [IN] OBSERVATION

No subsystem at any architectural tier has a viable recovery strategy: storage-layer recovery is paradoxically over-engineered in infrastructure and under-implemented in usage (WAL builds complete but unused sequence/checkpoint machinery, with no safe crash recovery path), while application-layer event sourcing lacks durable persistence and uses conflated ID spaces — recovery fails both where infrastructure exists but is unused and where it was never built.

no-transactional-offset-commit [IN] OBSERVATION

The partitioned log provides no mechanism to atomically commit consumer offsets together with processing side effects, ruling out exactly-once delivery without external idempotency

Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

no-vector-clocks-anywhere [IN] OBSERVATION

The entire repository uses scalar Lamport clocks for ordering; no vector clock or version vector implementation exists, so true concurrency detection (distinguishing causal order from concurrent writes) is absent

Source: entries/2026/05/29/topic-conflict-resolution-strategies.md

none-mode-never-syncs-on-append [IN] OBSERVATION

In none sync mode, dosync() is a complete no-op for individual appends; data reaches stable storage only via WAL rotation or explicit close

Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

not-lehman-yao-blink-tree [IN] OBSERVATION

The B-tree is a classical single-threaded B⁺-tree, not a Lehman-Yao B-link tree — internal nodes lack right-link pointers, there are no high keys, and no latch coupling; leaf sibling pointers exist solely for range scan traversal.

Source: entries/2026/05/29/topic-internal-node-format.md

offset-tracking-per-partition [IN] OBSERVATION

Consumer tracks offsets as a dict[tuple[str, int], int] mapping (topic, partition) to offset, so commit granularity is per-partition, not per-message

Source: entries/2026/05/29/topic-kafka-consumer-offset-semantics.md

only-new-primary-advances-view-in-vc [IN] OBSERVATION

Non-primary nodes do not update currentview in handleviewchange; they wait for the NEW_VIEW message to adopt the new view, leaving them in a liminal state that rejects normal-phase messages for both old and new views.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-handleview_change.md

operation-result-tracks-partition-cost [IN] OBSERVATION

Every DocumentPartitionedDB and TermPartitionedDB operation returns an OperationResult with a partitions_touched field, making read/write amplification cost observable and directly comparable between strategies.

Source: entries/2026/05/29/secondary-index-partitioning-testsecondaryindex_partitioning.md

ordering-infrastructure-broken-at-every-layer [IN] OBSERVATION

Ordering and position infrastructure is broken or volatile at every layer: WAL sequence numbers are diligently computed and stored but never consulted during recovery, event sourcing conflates stream-scoped and global position IDs creating addressing ambiguity, and CDC consumer positions exist only in memory and are lost on restart — no ordering mechanism in the codebase reliably serves its intended purpose.

ordering-models-are-incompatible-across-modules [IN] OBSERVATION

The codebase uses three incompatible time/ordering models with no unifying framework: Lamport clocks provide total order via node-ID tiebreaking, vector clocks provide partial order with explicit concurrency detection, and wall-clock timestamps provide no causal guarantees and are not even monotonic.

orset-add-wins-semantics [IN] OBSERVATION

Concurrent add and remove of the same element in ORSet resolves in favor of the add, because the add's unique tag was not in the remover's tombstone snapshot

Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

orset-no-causal-context [IN] OBSERVATION

The ORSet implementation uses no version vector, dot context, or causal summary to compress tombstone state; each removed tag is stored individually as a (replica_id, seq) tuple in the tombstone set

Source: entries/2026/05/29/topic-or-set-tombstone-growth.md

orset-remove-returns-false-for-absent [IN] OBSERVATION

ORSet.remove() returns False for a non-existent element rather than raising an exception, treating removal of absent elements as a no-op with a boolean status indicator.

Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

orset-tombstone-growth-unbounded-and-necessary [IN] OBSERVATION

ORSet tombstones create an inherent resource leak: they must be retained indefinitely for merge correctness (removing them causes deleted elements to reappear when merging with replicas that still hold the original tags), yet they grow monotonically with no compaction or garbage collection, meaning memory consumption increases without bound over the set's lifetime proportional to total remove operations.

orset-tombstone-required-for-merge-correctness [IN] OBSERVATION

Dropping ORSet tombstones would cause removed elements to reappear when merging with a replica that still holds the element's active tags — the tombstone set in merge() at line 165 suppresses stale tags via set difference

Source: entries/2026/05/29/topic-or-set-tombstone-growth.md

orset-tombstone-set-monotonic [IN] OBSERVATION

ORSet._tombstones is append-only: tags are added during remove() and unioned during merge() but never deleted, so the tombstone set grows without bound over the lifetime of the replica

Source: entries/2026/05/29/topic-or-set-tombstone-growth.md

orset-tombstones-grow-monotonically [IN] OBSERVATION

ORSet._tombstones is only ever unioned during merge() and added to during remove(); there is no compaction or garbage collection, so the tombstone set grows without bound

Source: entries/2026/05/29/topic-delta-state-crdts.md

overlapping-determines-merge-set [IN] OBSERVATION

The merge set for leveled compaction is always exactly one source SSTable plus all SSTables in the target level whose key range intersects it, constructed via _overlapping() at sstable.py:428-429.

Source: entries/2026/05/29/topic-leveled-compaction-write-amplification.md

overlapping-is-pure-query [IN] OBSERVATION

overlapping performs no I/O, mutation, or side effects — it operates entirely on in-memory metadata (minkey, max_key) cached during SSTableReader construction.

Source: entries/2026/05/29/sstable-and-compaction-sstable-_overlapping.md

overlapping-uses-interval-overlap-test [IN] OBSERVATION

overlapping determines SSTable key-range overlap using the standard interval overlap test (a.minkey <= b.maxkey AND a.maxkey >= b.min_key) with lexicographic string comparison matching the SSTable sort order.

Source: entries/2026/05/29/sstable-and-compaction-sstable-_overlapping.md

own-writes-always-visible [IN] OBSERVATION

A transaction always sees versions it created itself (createdby == tx.txid), regardless of commit status, ensuring read-your-own-writes consistency within a transaction.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-is_visible.md

padding-leaves-use-empty-hash [IN] OBSERVATION

Non-power-of-2 leaf counts are padded with EMPTY_HASH (SHA-256 of empty bytes) to form a complete binary tree; these padding nodes appear in proof paths but do not represent real data.

Source: entries/2026/05/29/topic-merkle-proof-security-model.md

page-manager-4096-coincidence [IN] OBSERVATION

PageManager defaults to 4096-byte pages matching the OS page size and common disk sector size, but this is a coincidental match with O_DIRECT alignment requirements — the implementation uses buffered I/O through Python's open() and never enforces alignment

Source: entries/2026/05/29/topic-o-direct-aligned-io.md

page-manager-flush-not-durable [IN] OBSERVATION

PageManager calls flush() on every writepage and write_meta call but only calls os.fsync() in sync() and close(), so individual page writes are not crash-durable without the WAL layer

Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

page-manager-meta-reads-untracked [IN] OBSERVATION

readmeta() does not increment pagesread, so metadata access is invisible to the I/O statistics reported by BTree.stats

Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

page-manager-metadata-durability-gap [IN] OBSERVATION

B-tree metadata has a durability gap at both lifecycle points: initial creation writes metadata directly without WAL protection, and all subsequent updates flush but never fsync, meaning metadata can be lost or inconsistent after any crash regardless of when it occurred.

page-manager-no-page-size-on-disk [IN] OBSERVATION

The page size is not persisted in the data file; reopening with a different page_size value silently corrupts all page reads with no error or detection mechanism

Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

page-padding-to-page-size [IN] OBSERVATION

walwritepage pads data to exactly pagesize with null bytes before passing it to both wal.logwrite and pm.writepage, ensuring the WAL record and data file page are byte-identical.

Source: entries/2026/05/29/b-tree-storage-engine-btree-walwrite_page.md

page-size-default-is-4096 [IN] OBSERVATION

PageManager defaults to 4096-byte pages, matching typical filesystem block size; this sets the hard upper bound for serialized leaf and internal node content.

Source: entries/2026/05/29/topic-page-overflow-and-split-mechanics.md

page-zero-is-metadata [IN] OBSERVATION

Page 0 is always the metadata page and must never be written directly — only through writemeta; user data pages start at page 1, with the initial empty leaf root always at page 1

Source: entries/2026/05/29/b-tree-storage-engine-btree-PageManager.md

partial-hint-load-truncates-silently [IN] OBSERVATION

loadhint_file in both Bitcask implementations treats a short read as end-of-file (breaks out of the read loop without error), silently dropping all keys that would have appeared after the truncation point

Source: entries/2026/05/29/topic-bitcask-hint-durability.md

participant-cannot-self-resolve [IN] OBSERVATION

Participant.recover() returns a list of in-doubt transaction IDs but cannot commit or abort them; only the coordinator holds the decision, so participants remain locked until Coordinator.recover() re-sends it.

Source: entries/2026/05/29/topic-2pc-blocking-problem.md

partition-failures-compound-across-membership-and-consensus [IN] OBSERVATION

Network partitions create cascading failures across the membership and consensus layers: gossip-based failure detection is the single correctness gate for replication, and Raft's partition hazard (stale leader silently accepting doomed writes, forced re-election on rejoin) depends on membership accuracy that gossip's timeout-based suspicion may misjudge.

partitioned-hash-join-partitions-both-sides [IN] OBSERVATION

PartitionedHashJoin.join() calls partition_dataset() on both inputs before joining, meaning partitions are computed at join time rather than assumed from storage layout

Source: entries/2026/05/29/topic-ddia-ch10-reduce-side-joins.md

partitioned-log-compaction-preserves-keyless [IN] OBSERVATION

Log compaction retains only the last occurrence per key but always preserves messages with key=None.

Source: entries/2026/05/29/partitioned-log-partitioned_log.md

partitioned-log-key-routing-deterministic [IN] OBSERVATION

Messages with the same key always route to the same partition via MD5(key) % num_partitions, guaranteeing per-key ordering within a partition.

Source: entries/2026/05/29/partitioned-log-partitioned_log.md

partitioned-log-offset-monotonic [IN] OBSERVATION

Offsets within a partition are strictly increasing and never reused; baseoffsets only advances forward via truncate and compact_partition.

Source: entries/2026/05/29/partitioned-log-partitioned_log.md

partitioned-log-partition-count-limit [IN] OBSERVATION

Topic._init_ enforces partition count between 1 and 128, raising ValueError outside that range.

Source: entries/2026/05/29/partitioned-log-partitioned_log.md

partitioned-log-persistence-jsonl [IN] OBSERVATION

When persistdir is set, messages are appended as JSONL files named {topic}{partition}.jsonl and committed offsets are written to a single offsets.json.

Source: entries/2026/05/29/partitioned-log-partitioned_log.md

partitioned-log-rebalance-eager-round-robin [IN] OBSERVATION

ConsumerGroup.rebalance() is synchronous and triggered on every membership change, distributing partitions round-robin across sorted consumer IDs — no incremental or cooperative protocol.

Source: entries/2026/05/29/partitioned-log-partitioned_log.md

payload-only-crc-leaves-metadata-unprotected [IN] OBSERVATION

All three storage engine WAL implementations share a systematic integrity blind spot: CRC checksums cover only data payloads, leaving routing metadata (sequence numbers, page numbers) unprotected against silent corruption.

pbft-cluster-validates-n-3f-plus-1 [IN] OBSERVATION

PBFTCluster raises ValueError if n is not exactly 3f+1 or if more than f byzantine nodes are specified, enforcing the minimum redundancy required for Byzantine fault tolerance.

Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

pbft-commit-self-logged-before-broadcast [IN] OBSERVATION

checkprepared logs the node's own COMMIT into messagelog and phasesenders before returning it for broadcast, so duplicate detection correctly rejects re-delivery of the node's own COMMIT.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-checkprepared.md

pbft-default-str-fragile [IN] OBSERVATION

The default=str parameter in the PBFT digest function silently converts non-serializable objects to their str() representation, which is not deterministic across replicas for types like datetime or custom objects — a latent consensus-breaking bug if non-primitive request payloads are introduced.

Source: entries/2026/05/29/topic-json-canonicalization-risks.md

pbft-duplicate-messages-silently-dropped [IN] OBSERVATION

PBFTNode.receive_message returns an empty list for duplicate or invalid messages (out-of-range sender IDs, already-seen tuples) rather than raising exceptions, consistent with Byzantine protocol design where you can't trust the sender.

Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

pbft-executed-log-sequence-numbers-contiguous [IN] OBSERVATION

Honest nodes execute requests with contiguous sequence numbers starting from 1, enforced by the protocol's ordering guarantees and verified by test assertions on the executedlog.

Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

pbft-prepared-at-most-once [IN] OBSERVATION

A (view, seq) pair can transition to "prepared" at most once per node, enforced by the preparedrequests set guard at the top of check_prepared.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-checkprepared.md

pbft-prepared-chains-to-committed [IN] OBSERVATION

checkprepared always calls checkcommitted after marking prepared, enabling immediate commit if enough COMMIT messages arrived out of order before the node reached the prepared state.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-checkprepared.md

pbft-prepared-threshold-is-2f [IN] OBSERVATION

checkprepared requires exactly 2f matching PREPAREs (not 2f+1) because the primary's PRE-PREPARE counts as implicit agreement toward the quorum of 2f+1.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-checkprepared.md

pbft-primary-skips-prepare [IN] OBSERVATION

The primary sends PRE-PREPARE but not PREPARE; handleprepare rejects PREPAREs from the primary, and the primary's code path in handlepre_prepare skips sending a PREPARE — this asymmetry is why the prepared threshold is 2f not 2f+1.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-checkprepared.md

pbft-sole-sort-keys-user [IN] OBSERVATION

byzantine-fault-tolerance/pbft.py:46 is the only call site in the codebase that uses json.dumps(sort_keys=True) for hashing; all other json.dumps calls are for storage serialization where canonical form is irrelevant.

Source: entries/2026/05/29/topic-json-canonicalization-risks.md

pbft-submit-request-is-synchronous-simulation [IN] OBSERVATION

PBFTCluster.submit_request runs the full three-phase protocol to completion deterministically in a single call (no real networking or async), returning a boolean success indicator.

Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

pbft-view-change-preserves-prepared-requests [IN] OBSERVATION

Requests that reach the prepared state survive view changes: the new primary collects prepared-but-uncommitted requests from VIEWCHANGE messages and re-proposes them with fresh PREPREPARE messages in the new view.

Source: entries/2026/05/29/byzantine-fault-tolerance-test_pbft.md

pbft-view-change-preserves-safety-and-liveness [IN] OBSERVATION

PBFT view changes maintain both safety (requiring 2f+1 messages before acting) and liveness (carrying prepared-but-uncommitted requests into the new view), matching the theoretical protocol guarantees.

pending-queue-drives-ring-propagation [IN] OBSERVATION

Every accepted remote change (including synthesized merge results) is appended to _pending for downstream propagation, enabling multi-hop replication in ring topologies where a single sync round cannot reach all nodes.

Source: entries/2026/05/29/multi-leader-replication-multileader-applyremote_change.md

pending-requeue-enables-multihop [IN] OBSERVATION

Every accepted remote change in applyremotechange is appended to _pending, causing it to be forwarded to the next node in the replication topology during the next sync cycle

Source: entries/2026/05/29/topic-ring-topology-propagation.md

per-record-crc-cannot-detect-reordering [IN] OBSERVATION

Independent per-record CRCs in wal.py, btree.py, and bitcask.py validate each record in isolation, so record deletion, reordering, or splicing produces a file where every CRC passes but the log is semantically corrupt

Source: entries/2026/05/29/topic-cumulative-checksums.md

persist-after-memory [IN] OBSERVATION

The event is appended to self.events (in-memory) before persist_event writes it to disk; a disk write failure leaves the in-memory store ahead of the on-disk log with no rollback.

Source: entries/2026/05/29/event-sourcing-store-eventstore-persist_event.md

persist-before-notify [IN] OBSERVATION

Disk persistence via persistevent happens before subscriber notification: an event is written to the NDJSON file before any LiveProjection or other subscriber sees it.

Source: entries/2026/05/29/event-sourcing-store-eventstore-persist_event.md

persist-event-no-fsync [IN] OBSERVATION

persistevent writes to the OS buffer via file.write() but never calls fsync; event data can be lost on crash between write and OS flush to disk.

Source: entries/2026/05/29/event-sourcing-store-eventstore-persist_event.md

persist-event-per-call-open [IN] OBSERVATION

The persist file is opened in append mode and closed on every single persistevent call, not held open across the store's lifetime — safe but incurs per-event open/close overhead.

Source: entries/2026/05/29/event-sourcing-store-eventstore-persist_event.md

pipeline-lazy-pull-model [IN] OBSERVATION

Pipeline execution is pull-based via nested Python generators; records flow only when the terminal consumer iterates, and stages interleave execution without threads

Source: entries/2026/05/29/batch-word-count-pipeline.md

pipeline-no-inter-stage-schema-validation [IN] OBSERVATION

Pipeline stages pass untyped tuples with no validation that the output shape of one stage matches the input expectations of the next; a mismatched stage composition fails at runtime, not at construction time

Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

pipeline-records-are-tuples [IN] OBSERVATION

Pipeline stages produce and consume lists of (key, value) tuples, where key is a string and value depends on the stage (string for Tokenize, int for Count).

Source: entries/2026/05/29/batch-word-count-test_pipeline.md

pipeline-run-returns-list [IN] OBSERVATION

Pipeline.run() materializes all output records into a list, while run_lazy() returns an iterator that yields records on demand without full materialization.

Source: entries/2026/05/29/batch-word-count-test_pipeline.md

pipeline-stages-hide-materialization-barriers [IN] OBSERVATION

Pipeline stages have fundamentally different resource profiles hidden behind a uniform tuple interface: Count is an unbounded memory barrier that materializes all input before emitting any output, Sort bounds memory via external merge-sort disk spillover, yet both produce and consume identical tuple lists with no way for the pipeline to distinguish streaming stages from blocking ones.

pncounter-composed-of-two-gcounters [IN] OBSERVATION

PNCounter state is structured as two GCounter sub-counters (p for increments, n for decrements) with value() = p - n, confirming the standard decomposition from CRDT literature.

Source: entries/2026/05/29/conflict-free-replicated-data-types-test_crdts.md

pncounter-composes-gcounters [IN] OBSERVATION

PNCounter delegates entirely to two GCounter instances (p for increments, n for decrements); it contains no independent merge or counting logic

Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

point-lookup-always-reaches-leaf [IN] OBSERVATION

Every get must descend to a leaf node regardless of tree height since internal nodes hold no values; cost is exactly *h* page reads for a tree of height *h*

Source: entries/2026/05/29/topic-b-plus-tree-leaf-chains.md

post-crash-compaction-produces-irrecoverable-corruption [IN] OBSERVATION

A crash during compaction produces irrecoverable data loss: no implementation uses write-temp/fsync/rename for atomicity (Bitcask deletes before renaming, LSM lacks atomic rename), and the resulting file corruption is terminal because every reader halts at the first CRC failure with no resync or skip capability.

postgres-defers-btree-rebalancing-to-vacuum [IN] OBSERVATION

PostgreSQL's nbtree never merges or redistributes siblings on delete; it marks tuples dead and relies on VACUUM to reclaim space, trading immediate space efficiency for throughput under concurrency

Source: entries/2026/05/28/topic-postgres-nbtree-lazy-deletion.md

predicate-exceptions-silently-skipped [IN] OBSERVATION

If the predicate function throws on a (key, value) pair in read_predicate, that pair is silently excluded from results via bare except Exception: pass — no logging, no error propagation

Source: entries/2026/05/29/write-skew-detection-ssidatabase-readpredicate.md

primary-key-reads-unaffected-by-async [IN] OBSERVATION

get(pk) reads directly from partitions[pid].documents and is never affected by asyncindex mode — staleness only applies to querybyfield which reads from globalindex

Source: entries/2026/05/29/topic-async-index-consistency.md

producer-keyed-hash-unkeyed-roundrobin [IN] OBSERVATION

Partitioned log Producer hashes keyed messages to a fixed partition for ordering guarantees, and round-robins unkeyed messages across partitions for load distribution

Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

projection-catch-up-advances-position-for-unhandled-events [IN] OBSERVATION

catchup advances position for every event, not just those with registered handlers, so unhandled event types don't create replay gaps

Source: entries/2026/05/29/event-sourcing-store-eventstore-catchup.md

projection-catch-up-assumes-sequential-ids [IN] OBSERVATION

catchup uses fromposition=self.position + 1 arithmetic that assumes eventid values are contiguous integers with no gaps; IDs with gaps would cause events to be silently skipped

Source: entries/2026/05/29/event-sourcing-store-eventstore-catchup.md

projection-catch-up-poison-event-stalls [IN] OBSERVATION

If a handler raises during catchup, position is not advanced past the failing event, causing all subsequent catch_up calls to re-encounter and re-fail on the same event indefinitely

Source: entries/2026/05/29/event-sourcing-store-eventstore-catchup.md

projection-is-stateless-replay [IN] OBSERVATION

A Projection's _state is a cache of derived data, not a source of truth — it can be fully reconstructed from the event log at any time by replaying from position 0

Source: entries/2026/05/29/topic-cqrs-read-models.md

projection-snapshot-interval-counts-all-events [IN] OBSERVATION

The snapshot interval counter increments for every event processed during catch_up, including those without handlers, so snapshot frequency is based on total event throughput not just handled events

Source: entries/2026/05/29/event-sourcing-store-eventstore-catchup.md

protocol-safety-unfalsifiable-under-current-testing [IN] OBSERVATION

Distributed protocol safety claims are unfalsifiable under the current testing methodology: specific protocol gaps (2PC's blocking window with unused timeout, Raft's stale-leader writes) AND general protocol behavior are validated exclusively under synchronous simulation with deterministic delivery — safety violations that manifest only under asynchronous or partitioned delivery are structurally invisible to the test suite.

protocol-safety-validated-only-under-synchronous-model [IN] OBSERVATION

Distributed protocol safety properties are validated exclusively under synchronous simulation (deterministic tick-based delivery, no real network I/O), but the most critical failure mode — network partitions creating stale-leader write acceptance and forced re-elections — is inherently asynchronous, creating an untested gap between modeled and real-world safety.

put-append-only [IN] OBSERVATION

put never modifies existing on-disk data; overwrites create a new record appended to the active segment and update only the in-memory index, leaving old records as reclaimable garbage until compaction

Source: entries/2026/05/29/log-structured-hash-table-bitcask-put.md

put-no-tombstone-guard [IN] OBSERVATION

put does not reject the TOMBSTONE sentinel as a value; passing it creates a record indistinguishable from a delete, which silently corrupts the key's state on recovery

Source: entries/2026/05/29/log-structured-hash-table-bitcask-put.md

put-return-value-ignored-by-coordinator [IN] OBSERVATION

ReadRepairStore never checks the boolean return from Replica.put() during read repair or writes, relying on version monotonicity to make stale or redundant writes harmless no-ops.

Source: entries/2026/05/29/read-repair-read_repair.md

pytest-files-have-broader-coverage [IN] OBSERVATION

The test*.py files are consistently longer than their testertest_*.py counterparts (e.g., 207 vs 125 lines for bloom-filter), containing additional edge case tests beyond the tester's core invariant checks

Source: entries/2026/05/29/topic-tester-vs-test-convention.md

pytest-must-run-per-directory [IN] OBSERVATION

Tests must be executed from within each module's directory (e.g., cd bloom-filter && pytest) because imports use bare module names with no package prefix or installed package

Source: entries/2026/05/29/topic-conftest-and-pytest-config.md

python-bitcask-no-keydir-sharing [IN] OBSERVATION

Each BitcaskStore instance independently rebuilds the full in-memory index from disk; there is no shared-memory mechanism to preserve or share the keydir across instances even within the same process, unlike Erlang's ETS-backed keydir

Source: entries/2026/05/29/topic-erlang-ets-concurrency-model.md

python-bitcask-no-refcount-or-locking [IN] OBSERVATION

Neither Python Bitcask implementation has reference counting, reader registration, or file-handle locking — compaction can delete segment files while a concurrent reader holds a stale index entry, unlike Erlang Bitcask which defers deletion until the last reader's refcount drops to zero.

Source: entries/2026/05/29/topic-reference-counted-file-handles.md

quorum-guarantees-weakened-at-two-levels [IN] OBSERVATION

The distributed storage layer systematically weakens quorum semantics: sloppy quorums count hint storage as successful writes, and sub-quorum configurations (R+W<=N) produce only warnings rather than errors, making it possible to operate without intersection guarantees.

quorum-violation-warns-not-raises [IN] OBSERVATION

When R + W <= N, the ReadRepairStore constructor emits a warnings.warn rather than raising an exception, allowing intentionally relaxed quorum configurations.

Source: entries/2026/05/29/read-repair-read_repair.md

raft-advance-commit-on-tick-and-response [IN] OBSERVATION

advancecommitindex is invoked both on heartbeat ticks and immediately after successful append responses update match_index, allowing eager commit advancement without waiting for the next heartbeat cycle.

Source: entries/2026/05/29/raft-consensus-raft-advancecommit_index.md

raft-cluster-deterministic-simulation [IN] OBSERVATION

RaftCluster uses tick-based deterministic simulation with no real clocks, threads, or network I/O; time advances only via explicit tick(ms) calls, making partition and election scenarios fully reproducible.

Source: entries/2026/05/29/raft-consensus-test_raft.md

raft-commit-index-monotonic [IN] OBSERVATION

commitindex only ever increases within advancecommitindex because the loop starts at commit_index + 1 and only assigns forward; it can never decrease or be reset.

Source: entries/2026/05/29/raft-consensus-raft-advancecommit_index.md

raft-commit-requires-strict-majority [IN] OBSERVATION

An entry is committed when count > total // 2, meaning for a 5-node cluster at least 3 replicas (including the leader) must have the entry, matching Raft's quorum requirement.

Source: entries/2026/05/29/raft-consensus-raft-advancecommit_index.md

raft-current-term-commit-only [IN] OBSERVATION

advancecommitindex only commits entries whose term matches current_term, implementing the Raft safety property from §5.4.2 that prevents committing prior-term entries by replica count alone.

Source: entries/2026/05/29/raft-consensus-raft.md

raft-follower-rejects-writes [IN] OBSERVATION

RaftNode.client_request() on a follower returns {"success": False, "error": "not leader"} as a structured dict rather than raising an exception; only the leader accepts client writes.

Source: entries/2026/05/29/raft-consensus-test_raft.md

raft-is-multi-paxos-optimization [IN] OBSERVATION

The Raft implementation in raft-consensus/raft.py embodies the Multi-Paxos leader-lease pattern: one election per term via becomeleader, then Phase-2-only replication (AppendEntries) for all entries within that term — heartbeats act as the lease

Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

raft-log-direct-indexed [IN] OBSERVATION

Log entries are accessed by log index as a direct Python list index (self.log[prevlog_index]), which assumes all entries from index 0 onward are present — log truncation would require an offset or a different data structure

Source: entries/2026/05/29/topic-raft-log-compaction.md

raft-log-never-shrinks [IN] OBSERVATION

self._log in RaftNode is append-only: no method ever removes entries, so the log grows monotonically with client requests and memory usage is unbounded

Source: entries/2026/05/29/topic-raft-log-compaction.md

raft-match-index-defaults-zero [IN] OBSERVATION

Peers missing from matchindex are treated as having replicated nothing via dict.get(peer, 0); becomeleader initializes all peers to 0, so this default is safe but allows graceful handling of unexpected state.

Source: entries/2026/05/29/raft-consensus-raft-advancecommit_index.md

raft-minority-cannot-elect [IN] OBSERVATION

A partition containing fewer than a strict majority of nodes cannot elect a leader; get_leader() returns None for a 2-of-5 minority partition.

Source: entries/2026/05/29/raft-consensus-test_raft.md

raft-no-leader-forwarding [IN] OBSERVATION

client_request returns {"success": False, "error": "not leader"} if the node is not the leader; there is no automatic forwarding to the current leader.

Source: entries/2026/05/29/raft-consensus-raft.md

raft-no-prevote-or-lease [IN] OBSERVATION

The Raft implementation has no PreVote, leader lease, or CheckQuorum mechanism; split-brain safety relies entirely on term numbers and the single-vote-per-term invariant

Source: entries/2026/05/29/topic-raft-leader-stickiness.md

raft-no-state-machine-apply [IN] OBSERVATION

lastapplied is tracked but never used to actually apply commands to a state machine, making snapshotting impossible without first implementing application logic

Source: entries/2026/05/29/topic-raft-log-compaction.md

raft-partition-creates-dual-hazard [IN] OBSERVATION

Network partitions create a compound safety hazard in Raft: the isolated leader silently accepts writes that can never commit, and when it rejoins its inflated term forces a disruptive re-election of the healthy partition's leader.

raft-partition-via-set [IN] OBSERVATION

Network partitions are simulated by adding node IDs to _partitioned; partitioned nodes neither tick nor send/receive messages, modeling complete network isolation.

Source: entries/2026/05/29/raft-consensus-raft.md

raft-randomized-timeout-only-split-vote-defense [IN] OBSERVATION

The only mechanism preventing simultaneous candidacy is the randomized election timeout range (default 150–300ms); there is no protocol-level tiebreaker such as PreVote

Source: entries/2026/05/29/topic-raft-leader-stickiness.md

raft-rejoin-forces-election [IN] OBSERVATION

A partitioned node that rejoins after repeated failed elections carries an inflated term number, forcing the healthy leader to step down via becomefollower and triggering a completely unnecessary cluster-wide election

Source: entries/2026/05/29/topic-raft-leader-stickiness.md

raft-sentinel-entry-assumption [IN] OBSERVATION

The Raft log is initialized with a sentinel LogEntry(term=0, index=0, command=None) at position 0, and multiple methods depend on this entry always being present at self._log[0]

Source: entries/2026/05/29/topic-raft-log-compaction.md

raft-sentinel-log-entry [IN] OBSERVATION

The log is initialized with a sentinel LogEntry(term=0, index=0) at position 0 that is never removed, eliminating empty-log edge cases in lastlogindex() and lastlogterm().

Source: entries/2026/05/29/raft-consensus-raft.md

raft-single-tick-delivery [IN] OBSERVATION

RaftCluster.tick() collects outbound messages and delivers both the request and its response within the same tick invocation — there is no simulated network latency.

Source: entries/2026/05/29/raft-consensus-raft.md

raft-stale-leader-accepts-writes [IN] OBSERVATION

A partitioned leader continues accepting client requests (returning success: True) even though those entries can never commit and will be overwritten when the partition heals — the gap a leader lease would close

Source: entries/2026/05/29/topic-raft-leader-stickiness.md

raft-terms-strictly-increase [IN] OBSERVATION

Each new leader election produces a strictly higher term than the previous leader; terms never repeat or decrease across leader transitions, as validated by the testtermsmonotonically_increasing test.

Source: entries/2026/05/29/raft-consensus-test_raft.md

raft-uncommitted-entries-overwritten [IN] OBSERVATION

Uncommitted log entries from a deposed leader are overwritten when the node receives AppendEntries from a new leader with conflicting entries at the same index; only committed entries are durable.

Source: entries/2026/05/29/raft-consensus-test_raft.md

range-partition-merge-guards-threshold [IN] OBSERVATION

mergesmallpartitions() only merges adjacent partitions when their combined size stays below minpartitionsize, preventing immediate re-splitting after merge.

Source: entries/2026/05/29/range-partitioning-testrangepartitioning.md

range-partition-merge-is-greedy-left-to-right [IN] OBSERVATION

mergesmallpartitions sweeps left-to-right and greedily merges adjacent pairs whose combined size is at or below minpartitionsize, which can produce different results than an optimal merge strategy

Source: entries/2026/05/29/topic-hash-partitioning-skew.md

range-partition-no-exceptions-api [IN] OBSERVATION

RangePartitionedStore uses return-value signaling: get() returns None for missing keys and delete() returns False for absent keys, rather than raising exceptions.

Source: entries/2026/05/29/range-partitioning-testrangepartitioning.md

range-partition-sequential-insert-skew [IN] OBSERVATION

Range partitioning has its own skew risk: keys arriving in sorted order all hit the rightmost partition; Partition.split() mitigates this for data already stored by splitting at the median, but cannot prevent temporary hotspots during sequential ingestion

Source: entries/2026/05/29/topic-hash-partitioning-skew.md

range-partition-split-at-median [IN] OBSERVATION

Partition.split always divides at the median key (index len // 2), which can produce uneven partitions when key distribution is skewed

Source: entries/2026/05/29/topic-ddia-chapter-6-partitioning.md

range-partitioning-auto-split-manual-merge [IN] OBSERVATION

Splits are triggered automatically during put() when a partition exceeds maxpartitionsize, but merges require an explicit call to mergesmallpartitions() and never happen implicitly.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-partitioning-boundary-routing-bisect [IN] OBSERVATION

Key routing uses bisect_right(boundaries, key) - 1 on a parallel sorted list of partition start keys, giving O(log p) partition lookup where p is the partition count.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-partitioning-contiguous-half-open-coverage [IN] OBSERVATION

Partitions use half-open intervals [startkey, endkey) with the first partition starting at "" and the last having end_key=None, guaranteeing the entire string keyspace is covered with no gaps or overlaps.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-partitioning-merge-assumes-adjacency [IN] OBSERVATION

Partition.merge(other) appends the other partition's keys without re-sorting; correctness depends on the caller passing the immediate right neighbor, with no runtime adjacency validation.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-partitioning-no-thread-safety [IN] OBSERVATION

No concurrency protection exists on any mutation path (put, delete, split, merge); concurrent access would corrupt the sorted key/value lists.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-partitioning-parallel-arrays-routing [IN] OBSERVATION

partitions and boundaries are kept in lockstep as parallel arrays — index i in both refers to the same partition — trading O(n) insert/delete for O(1) indexed access and O(log n) binary search routing.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-partitioning-routing-is-structurally-fragile [IN] OBSERVATION

Range partitioning routing depends on maintaining strict lockstep invariants across parallel data structures that are never verified: boundaries and partitions are kept as parallel arrays where any desynchronization silently misroutes keys, merge operations assume partition adjacency without checking, and splits always divide at the median regardless of key distribution skew.

range-partitioning-split-at-median-index [IN] OBSERVATION

Partition.split() divides at len(keys)//2 (count-based median), not the lexicographic midpoint of the key range, producing equal-count halves but potentially unequal key ranges.

Source: entries/2026/05/29/range-partitioning-range_partitioning.md

range-scan-end-key-exclusive [IN] OBSERVATION

rangescan(startkey, endkey) treats endkey as an exclusive upper bound — range_scan("k05", "k10") returns keys k05 through k09

Source: entries/2026/05/29/topic-range-scan-implementation.md

range-scan-follows-sibling-chain [IN] OBSERVATION

rangescan walks the leaf-level linked list via nextsibling pointers rather than re-descending the tree, giving O(height + leaf pages in range) page reads.

Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

range-scan-half-open-interval [IN] OBSERVATION

BTree.rangescan(startkey, endkey) returns keys in the half-open interval [startkey, endkey) — start-inclusive, end-exclusive; endkey=None scans to the last key.

Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

range-scan-lacks-safety-guarantees [IN] OBSERVATION

Range scans are vulnerable to three independent failure modes: concurrent modifications produce inconsistent results (no snapshot isolation), corrupted sibling pointers can cause infinite traversal (no cycle guard), and large result sets consume unbounded memory (eager materialization into a list).

range-scan-last-writer-wins [IN] OBSERVATION

range_scan iterates sources oldest-to-newest (SSTables, then immutable memtables, then active memtable) so that dict overwrites naturally give each key the value from the newest source

Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

range-scan-materializes-results [IN] OBSERVATION

range_scan collects all matching (key, value) pairs into a list in memory rather than yielding lazily; large ranges consume proportional memory.

Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

range-scan-merges-memtable-and-sstables [IN] OBSERVATION

rangescan() combines results from both the active memtable and all SSTables in self.sstables, making it sensitive to mutations of either data structure during iteration

Source: entries/2026/05/28/topic-lsm-concurrency-safety.md

range-scan-newer-wins-by-priority [IN] OBSERVATION

The range scan merge uses an incrementing priority counter where higher priority = newer source; the dict naturally deduplicates by overwriting older entries for the same key

Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

range-scan-no-concurrency-protection [IN] OBSERVATION

range_scan can observe inconsistent state if a flush or compaction runs concurrently, since there is no snapshot isolation or locking

Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

range-scan-no-cycle-guard [IN] OBSERVATION

rangescan has no visited-set or cycle detection on the leaf sibling chain; a corrupted nextsibling pointer would cause an infinite loop.

Source: entries/2026/05/28/b-tree-storage-engine-btree-range_scan.md

range-scan-no-mutation [IN] OBSERVATION

Partition.rangescan is a pure read operation with no side effects — it never modifies keys, _values, or any other state, and returns a fresh list the caller can mutate freely.

Source: entries/2026/05/29/range-partitioning-rangepartitioning-rangescan.md

range-scan-none-end-scans-unbounded [IN] OBSERVATION

Passing end=None to Partition.rangescan sets the right boundary to len(self.keys), scanning from start through the last key with no upper bound.

Source: entries/2026/05/29/range-partitioning-rangepartitioning-rangescan.md

range-scan-priority-field-unused [IN] OBSERVATION

The priority integer stored in the merged dict during range_scan is never read or compared; correctness relies entirely on the source iteration order and dict overwrite semantics, making the priority field dead weight

Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

range-scan-pure-read [IN] OBSERVATION

range_scan has no side effects: it does not mutate tree state, trigger flushes, or modify the WAL

Source: entries/2026/05/28/log-structured-merge-tree-lsm-range_scan.md

range-scan-relies-on-sorted-keys [IN] OBSERVATION

rangescan correctness depends on keys being sorted, an invariant maintained by put/delete/split/merge but not enforced structurally — direct mutation of _keys bypassing these methods silently produces wrong results.

Source: entries/2026/05/29/range-partitioning-rangepartitioning-rangescan.md

range-scan-supports-unbounded [IN] OBSERVATION

Calling rangescan(startkey) without an endkey scans from startkey through the last key in the rightmost leaf, terminating at the NO_SIBLING sentinel

Source: entries/2026/05/29/topic-range-scan-implementation.md

reaches-uses-identity-not-equality [IN] OBSERVATION

_reaches uses is (object identity) to find the source event, not ==, because dataclass equality would conflate distinct events with identical field values.

Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

reaches-walks-two-edge-types [IN] OBSERVATION

The causal graph has two edge types: parent (same-node sequential) and cause (cross-node send→receive), and _reaches must follow both to correctly determine happens-before.

Source: entries/2026/05/29/lamport-clocks-lamport-_reaches.md

read-path-anticipates-concurrent-flush [IN] OBSERVATION

Both get() (line 254) and rangescan() (line 286) check immutable_memtables in the correct order, indicating the read path was designed for a concurrent-flush architecture that the write path never implements

Source: entries/2026/05/29/topic-immutable-memtable-gap.md

read-path-unreliable-from-storage-through-derived-systems [IN] OBSERVATION

The complete read path from storage through derived systems is unreliable at every stage: the SSTable layer has compounding integrity and performance deficiencies (no per-entry checksums, linear miss-probes from missing Bloom integration, redundant key-list rebuilds per lookup) while the CDC backbone feeding all derived systems depends on reconstruction heuristics for event semantics and requires callers to manually flush and provide old values for consistency.

read-predicate-deps-only-for-committed [IN] OBSERVATION

read_predicate records dependency graph edges only for keys in the committed snapshot (k in snap), not for keys the transaction itself created via write() — avoiding self-dependencies

Source: entries/2026/05/29/write-skew-detection-ssidatabase-readpredicate.md

read-predicate-stores-result-snapshot [IN] OBSERVATION

readpredicate appends a (predicate, dict(result)) tuple to tx.predicate_locks — a deep copy so commit() can compare against the original result even after database state changes

Source: entries/2026/05/29/write-skew-detection-ssidatabase-readpredicate.md

read-repair-resurrects-deleted-keys [IN] OBSERVATION

DynamoCluster.get() read repair pushes the highest-versioned value to stale replicas, so a naive delete (removing from _store) is undone when a recovering node still holds the old value

Source: entries/2026/05/29/topic-leaderless-deletion-gap.md

read-repair-scope-is-quorum-only [IN] OBSERVATION

Read repair in get() only fixes replicas that were part of the read quorum, not all replicas; full-cluster repair requires a separate call to antientropyrepair().

Source: entries/2026/05/29/read-repair-read_repair.md

read-uses-linear-scan-for-offset-gaps [IN] OBSERVATION

Topic.read finds the start position by scanning for msg.offset >= offset rather than computing an array index, which is necessary because compaction creates offset holes that break arithmetic indexing.

Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

readlines-accepts-list-or-filepath [IN] OBSERVATION

ReadLines accepts either a list[str] of in-memory lines or a file path string as its input source, providing two input modes from the same stage.

Source: entries/2026/05/29/batch-word-count-test_pipeline.md

reads-and-writes-same-consensus-path [IN] OBSERVATION

LinearizableRegister routes both reads and writes through the same TOB slot mechanism, giving them positions in the same total order — no read-only fast path exists.

Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

receive-gossip-death-is-sticky [IN] OBSERVATION

Once a node's local status is dead, receive_gossip will never change it back to alive, even if the remote has a higher heartbeat counter — preventing flapping after failure declaration

Source: entries/2026/05/29/gossip-protocol-gossipprotocol-receivegossip.md

receive-gossip-deep-copies-on-insert [IN] OBSERVATION

receivegossip deep-copies remote entries before inserting them into local membership, extending the isolation guarantee beyond the sendgossip/join/getmembershiplist paths covered by gossip-deep-copy-isolation

Source: entries/2026/05/29/gossip-protocol-gossipprotocol-receivegossip.md

receive-gossip-no-self-exclusion [IN] OBSERVATION

Unlike detectfailures which skips the node's own ID, receivegossip has no self-exclusion guard — if the remote gossip contains an entry for the receiver, it is merged normally

Source: entries/2026/05/29/gossip-protocol-gossipprotocol-receivegossip.md

receive-replica-is-passive-endpoint [IN] OBSERVATION

receivereplica in vector-clocks/vector_clock.py is the receiving side of anti-entropy replication; nothing in the codebase calls it automatically — tests simulate the call manually, and no gossip or sync module invokes it

Source: entries/2026/05/29/topic-dynamo-anti-entropy.md

reconcile-replaces-all-siblings [IN] OBSERVATION

reconcile unconditionally replaces all siblings for a key with a single merged version, regardless of whether the provided contexts cover every existing sibling.

Source: entries/2026/05/29/vector-clocks-vector_clock.md

reconstruct-state-is-pure-read [IN] OBSERVATION

reconstruct_state never modifies the EventStore; it is a read-only fold over a single stream's events that returns a locally-constructed dict

Source: entries/2026/05/29/event-sourcing-store-eventstore-reconstructstate.md

reconstruct-state-no-snapshot-support [IN] OBSERVATION

reconstructstate always replays from the beginning of the stream via readstream with default from_version=0; it does not leverage snapshots, unlike Projection

Source: entries/2026/05/29/event-sourcing-store-eventstore-reconstructstate.md

reconstruct-state-up-to-is-global-id [IN] OBSERVATION

The upto parameter in reconstructstate filters on global event_id, not stream-local version number, despite operating on a single stream

Source: entries/2026/05/29/event-sourcing-store-eventstore-reconstructstate.md

record-format-is-length-prefixed [IN] OBSERVATION

Each WAL record starts with a 4-byte little-endian length prefix (not counting itself), followed by CRC, seqnum, optype, and length-prefixed key/value pairs — minimum 25 bytes per record for empty key/value

Source: entries/2026/05/28/topic-wal-checksum-format.md

record-length-implicitly-validated [IN] OBSERVATION

Corruption of recordlength causes read_record to consume wrong bytes for the payload, which fails the CRC check indirectly — making explicit CRC coverage of the framing field unnecessary

Source: entries/2026/05/28/topic-wal-crc-coverage-gap.md

recover-seq-no-gap-detection [IN] OBSERVATION

recoverseq_num finds the max sequence number but does not detect or report gaps in the sequence space, meaning lost records between rotations or corruption are silently ignored.

Source: entries/2026/05/29/write-ahead-log-wal-recoverseq_num.md

recover-seq-num-is-global-max [IN] OBSERVATION

recoverseq_num returns the maximum sequence number across all WAL files, not just the latest file, ensuring correctness even when records are spread across rotated segments.

Source: entries/2026/05/29/write-ahead-log-wal-recoverseq_num.md

recover-seq-skips-corrupt-files [IN] OBSERVATION

recoverseqnum treats CRC errors as file-local (abandons the corrupted file, continues scanning subsequent files), unlike readallrecords which stops all scanning on corruption.

Source: entries/2026/05/29/write-ahead-log-wal-recoverseq_num.md

recovery-destroys-both-durability-and-isolation [IN] OBSERVATION

A single restart simultaneously destroys both durability and isolation guarantees along independent axes: the WAL's carefully calibrated write-time durability tiers (per-write fsync vs batch-only fsync) become invisible to recovery, and the transaction isolation model's composed invariant layers (MVCC visibility plus SSI conflict detection) break because abort metadata and monotonic counters are non-persistent.

recovery-requires-participant-availability [IN] OBSERVATION

Coordinator.recover() checks is_available() before re-sending decisions and skips unavailable participants, meaning a double failure (coordinator crash + participant crash) leaves locks held until both are up and recovery re-runs.

Source: entries/2026/05/29/topic-2pc-blocking-problem.md

recovery-scans-all-segments [IN] OBSERVATION

All three storage implementations (WAL, hash-index Bitcask, log-structured Bitcask) scan every segment file on startup to rebuild state, making recovery time proportional to segment count rather than data volume.

Source: entries/2026/05/29/topic-wal-segment-sizing-tradeoffs.md

recovery-simultaneously-over-and-under-engineered [IN] OBSERVATION

Recovery is paradoxically both over-engineered and under-implemented: the WAL builds complete sequence number and checkpoint infrastructure that's never consulted during replay, while the actual recovery paths lack batch atomicity and metadata CRC protection — engineering effort was invested in infrastructure that goes unused while the active recovery path remains unsafe.

reference-implementations-lack-tombstone-ttl [IN] OBSERVATION

None of the DDIA reference implementations combine tombstone support with time-bounded garbage collection; the multi-leader module stores tombstones indefinitely and the Dynamo module has no delete support at all.

Source: entries/2026/05/29/topic-tombstone-gc-and-repair-window.md

rename-without-dir-barrier [IN] OBSERVATION

Both Bitcask compaction paths (hash-index-storage/bitcask.py:297, log-structured-merge-tree/bitcask.py:301) perform os.rename() without a subsequent directory fsync, making the rename non-durable on ext4 and XFS

Source: entries/2026/05/28/topic-directory-fsync-semantics.md

renew-preserves-token-number [IN] OBSERVATION

LockService.renew() extends the lock TTL by mutating issued_at and ttl without changing the FencingToken.token value, while re-acquire() by the same client issues a new higher token

Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

replay-does-not-enforce-batch-atomicity [IN] OBSERVATION

Despite the docstring claiming replay "skips uncommitted batches," the implementation returns all PUT/DELETE records regardless of whether a matching COMMIT record exists — a crash mid-batch will replay partial batch records

Source: entries/2026/05/28/write-ahead-log-wal-replay.md

replay-flush-before-read [IN] OBSERVATION

replay flushes the write file descriptor under self._lock before reading, ensuring all prior append calls are visible on disk during the read phase

Source: entries/2026/05/28/write-ahead-log-wal-replay.md

replay-lacks-batch-atomicity-across-implementations [IN] OBSERVATION

Both the unbundled WAL and B-tree WAL replay all CRC-valid records without verifying batch completeness, meaning partial batches from mid-write crashes are silently applied as if they were complete transactions.

replay-not-lock-protected-during-read [IN] OBSERVATION

The file read phase of replay runs without holding self._lock, so concurrent append calls during replay could produce a partial read of in-flight writes

Source: entries/2026/05/28/write-ahead-log-wal-replay.md

replay-returns-eager-list [IN] OBSERVATION

replay returns an in-memory List[WALRecord] snapshot, not a lazy iterator — the full result set is materialized before returning, which can be large if the WAL is large

Source: entries/2026/05/28/write-ahead-log-wal-replay.md

replay-strict-greater-than-filter [IN] OBSERVATION

replay(afterseq=n) returns records with seqnum strictly greater than n (exclusive lower bound), not greater-or-equal — passing 0 replays the entire log

Source: entries/2026/05/28/write-ahead-log-wal-replay.md

replica-get-does-not-check-availability [IN] OBSERVATION

Replica.get does not consult self.available; filtering unavailable replicas is the caller's responsibility via ReadRepairStore.available_replicas().

Source: entries/2026/05/29/read-repair-read_repair-get.md

replica-get-has-no-side-effects [IN] OBSERVATION

Replica.get performs no mutations, availability checks, or I/O — it is a pure dictionary lookup, unlike ReadRepairStore.get which triggers read repair on stale replicas.

Source: entries/2026/05/29/read-repair-read_repair-get.md

replica-get-returns-tuple-or-none [IN] OBSERVATION

Replica.get returns a (value, version) tuple when the key exists, or None when it doesn't — callers must guard against None before destructuring, as there is no sentinel tuple.

Source: entries/2026/05/29/read-repair-read_repair-get.md

replica-put-no-concurrency-safety [IN] OBSERVATION

The version check and store update in Replica.put are not atomic (TOCTOU race on lines 22–25); the method is unsafe under concurrent access without external locking.

Source: entries/2026/05/29/read-repair-read_repair-put.md

replica-put-rejects-older-versions [IN] OBSERVATION

Replica.put() silently rejects writes with a version strictly less than the stored version (returns False), enforcing monotonic version advancement; equal versions are accepted as last-writer-wins.

Source: entries/2026/05/29/read-repair-read_repair.md

replica-versions-caller-coordinated [IN] OBSERVATION

Replica.put does not generate version numbers; version assignment is the responsibility of ReadRepairStore.put, which reads the max version across all replicas (including unavailable ones) before writing.

Source: entries/2026/05/29/read-repair-read_repair-put.md

replica-versions-start-at-one [IN] OBSERVATION

Versions assigned by ReadRepairStore.put start at 1 (computed as max_version + 1 with initial max of 0), so a version of 0 in ReadRepairStore.get return values means "key not found," not "initial version."

Source: entries/2026/05/29/read-repair-read_repair-get.md

reprepare-merges-by-sequence [IN] OBSERVATION

When the new primary merges prepared sets from VIEW_CHANGE messages, it deduplicates by sequence number using first-writer-wins without verifying that all nodes prepared the same request at that sequence — relying on the PBFT safety proof that at most one request can be prepared per (view, seq).

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-handleview_change.md

resolve-record-dead-code [IN] OBSERVATION

writerfieldmap is constructed but never referenced in resolverecord, making it dead code likely left from a refactor

Source: entries/2026/05/29/avro-serializer-avroserializer-resolve_record.md

resolve-record-shared-defaults [IN] OBSERVATION

Default values in resolverecord are inserted by reference without deep copying, so mutable defaults (lists, dicts) would be shared across all decoded records that use the default

Source: entries/2026/05/29/avro-serializer-avroserializer-resolve_record.md

resolve-record-skip-unknown [IN] OBSERVATION

Writer fields not present in the reader schema are fully consumed from the buffer via _skip() and discarded, ensuring the byte stream stays correctly positioned for subsequent fields

Source: entries/2026/05/29/avro-serializer-avroserializer-resolve_record.md

resolve-record-two-pass [IN] OBSERVATION

resolverecord uses a two-pass algorithm: pass 1 consumes all writer fields from the buffer in wire order (decoding or skipping each), pass 2 assembles the result dict in reader field order from decoded values and defaults

Source: entries/2026/05/29/avro-serializer-avroserializer-resolve_record.md

resource-token-tracking-is-independent [IN] OBSERVATION

Each resource in FencedResourceServer maintains its own highesttoken entry, so a high token on resource A does not cause rejection of a lower token on resource B.

Source: entries/2026/05/29/topic-redlock-controversy.md

ring-requeues-via-apply-remote-change [IN] OBSERVATION

applyremotechange appends accepted changes to the receiving node's _pending list, enabling store-and-forward propagation in ring topology; without this re-enqueue, changes would stop at the first hop.

Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

ring-requires-n-minus-one-rounds [IN] OBSERVATION

RING topology advances changes by exactly one hop per sync() round due to snapshot-isolated pending queue draining, requiring N-1 rounds for a single-source change to reach all N nodes.

Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

ring-topology-convergence-is-linear-in-node-count [IN] OBSERVATION

Ring topology requires O(N) sync rounds for full convergence because each round advances changes by exactly one hop via store-and-forward requeuing, unlike fully-connected topologies that converge in one round.

ring-topology-needs-multiple-sync-rounds [IN] OBSERVATION

A 3-node ring topology requires at least 2 sync() rounds to achieve full convergence, unlike a fully-connected topology which converges in 1 round.

Source: entries/2026/05/29/multi-leader-replication-testmultileader.md

rotation-always-fsyncs [IN] OBSERVATION

The WAL _rotate() method unconditionally calls flush() and os.fsync() before closing the current segment, providing a durability checkpoint even in none mode

Source: entries/2026/05/29/topic-fsync-semantics-by-mode.md

rotation-preserves-fd-invariant [IN] OBSERVATION

After mayberotate returns, self.fd is guaranteed non-None and open for writing (assuming it was non-None on entry), because rotate() always opens a new file.

Source: entries/2026/05/29/write-ahead-log-wal-mayberotate.md

saturation-boundary-untested [IN] OBSERVATION

No test in testbloomfilter.py exercises the counter saturation boundary (16+ collisions at a single position) to verify deletion correctness under overflow

Source: entries/2026/05/29/topic-counter-overflow-and-cuckoo-filters.md

save-snapshot-deep-copies-state [IN] OBSERVATION

savesnapshot uses copy.deepcopy to isolate the stored snapshot from subsequent mutations to state

Source: entries/2026/05/29/topic-snapshot-storage-format.md

scalable-bloom-tightens-fpr [IN] OBSERVATION

Each new slice in ScalableBloomFilter uses FPR of p * (ratio ^ slice_index), so the aggregate false positive rate stays bounded even as the filter grows unboundedly

Source: entries/2026/05/28/topic-bloom-filter-integration.md

scan-range-corrupt-data-silent [IN] OBSERVATION

If readentry encounters truncated data mid-range, scanrangeforkey silently returns (False, None) rather than raising an error, making corruption indistinguishable from a missing key

Source: entries/2026/05/29/log-structured-merge-tree-lsm-scanrangeforkey.md

scan-range-sorted-order-assumption [IN] OBSERVATION

scanrangeforkey assumes entries between start and end are sorted by key ascending; it breaks early on k > key, so unsorted data causes silent false negatives

Source: entries/2026/05/29/log-structured-merge-tree-lsm-scanrangeforkey.md

scan-segment-ordering-invariant [IN] OBSERVATION

scansegment must be called on segments in ascending segment-ID order for the index to correctly reflect last-write-wins semantics; no runtime check enforces this ordering — it is the caller's responsibility.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-scansegment.md

schema-registry-in-memory-only [IN] OBSERVATION

SchemaRegistry is a pure in-memory dict with monotonic ID assignment, with no persistence, HTTP API, or subject-based grouping

Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

schema-registry-no-compat-enforcement [IN] OBSERVATION

Schema registration accepts any schema without checking compatibility against previously registered versions under the same subject, unlike Confluent which enforces backward/forward/full compatibility on registration

Source: entries/2026/05/29/topic-confluent-schema-registry-protocol.md

secondary-index-document-class-unused [IN] OBSERVATION

The Document dataclass exists as a domain concept but the DB classes accept pk and fields separately and never consume Document instances.

Source: entries/2026/05/29/secondary-index-partitioning-secondaryindexpartitioning.md

secondary-index-no-error-on-missing [IN] OBSERVATION

get() on a nonexistent key returns OperationResult(data=None) and querybyfield() for absent values returns empty results rather than raising exceptions; documents missing indexed fields are stored but not indexed.

Source: entries/2026/05/29/secondary-index-partitioning-testsecondaryindex_partitioning.md

secondary-index-stale-entry-cleanup [IN] OBSERVATION

Both DocumentPartitionedDB and TermPartitionedDB remove stale index entries before writing new ones during put and delete, preventing phantom results from old field values.

Source: entries/2026/05/29/secondary-index-partitioning-secondaryindexpartitioning.md

seen-set-keyed-by-ts-origin [IN] OBSERVATION

seen maps each key to a set of (timestamp, originnode_id) tuples; a change is skipped if its exact (ts, origin) pair is already in the set for that key

Source: entries/2026/05/29/topic-ring-topology-propagation.md

segment-rotation-limits-corruption-blast-radius [IN] OBSERVATION

Both Bitcask implementations rotate to new segment files at a configurable size threshold, bounding the maximum data loss from stop-at-first-error to the records trailing the corruption within a single segment

Source: entries/2026/05/29/topic-record-boundary-recovery.md

seq-num-corruption-is-silent [IN] OBSERVATION

A bit flip in a WAL record's seq_num passes CRC validation and is silently accepted with the wrong sequence number, potentially corrupting replay filtering, truncation boundaries, and recovery sequence numbering

Source: entries/2026/05/28/topic-wal-crc-coverage-gap.md

seq-num-crc-fix-is-zero-cost [IN] OBSERVATION

Adding seqnum to the CRC input requires changing two lines (encoderecord and read_record) with no measurable runtime overhead, since CRC32 over 9 extra bytes is negligible

Source: entries/2026/05/29/topic-seq-num-integrity-gap.md

serialization-has-no-overflow-check [IN] OBSERVATION

serializeleaf packs all entries into a buffer without comparing its length to page_size; overflow prevention is entirely the caller's responsibility

Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

serialize-leaf-is-pure [IN] OBSERVATION

serializeleaf is a pure function with no side effects; all I/O happens in the caller via walwrite_page

Source: entries/2026/05/28/b-tree-storage-engine-btree-serializeleaf.md

sibling-direction-matters [IN] OBSERVATION

Proof siblings are tagged "left" or "right" because SHA-256 concatenation is order-dependent — H(A||B) != H(B||A) — so swapping sibling position produces a different parent hash and verification fails.

Source: entries/2026/05/29/topic-merkle-proof-security-model.md

sibling-field-is-structural-only [IN] OBSERVATION

Leaf pages carry a nextsibling field in their serialized format (NOSIBLING = 0xFFFFFFFF sentinel), but no tree operation uses sibling pointers for traversal or merging — it is scaffolding, not a live data structure. Note: may refine existing btree-leaf-sibling-chain.

Source: entries/2026/05/28/topic-concurrent-btree-deletion.md

sibling-pointers-leaf-only [IN] OBSERVATION

Sibling pointers exist only on leaf nodes (serializeleaf at line 182); internal nodes have no right-links, making this a B+-tree, not a B-link tree

Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

sibling-preservation-on-concurrent-writes [IN] OBSERVATION

VersionedKVStore.put never drops an existing version unless the new version's clock dominates it; concurrent versions are preserved as siblings.

Source: entries/2026/05/29/vector-clocks-vector_clock.md

silent-corruption-returns-garbage [IN] OBSERVATION

A bit-flip in a value region of hash-index-storage/bitcask.py causes get() to return corrupted data with no error; the only partial integrity check is the key-equality assertion

Source: entries/2026/05/29/topic-hash-index-bitcask-no-crc.md

silent-data-loss-is-default-operational-mode [IN] OBSERVATION

Silent data loss is the default operational mode: corruption propagates through every data transformation pipeline without detection (compaction copies without CRC validation, hint generation skips integrity checks), and the testing methodology cannot observe durability failures (no crash tests, no fsync verification), meaning data degrades continuously with no feedback signal to operators.

silent-mode-drops-all-outbound [IN] OBSERVATION

SILENT Byzantine mode returns an empty list for all outbound messages, but the node still processes incoming messages and updates internal state — simulating a crash fault rather than a protocol violation

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-applybyzantine.md

single-file-design-enables-atomic-sync [IN] OBSERVATION

PageManager stores metadata (page 0) and data pages in a single file, so one os.fsync() call in sync() covers both; splitting into separate files would require multiple fsyncs with ordering constraints in the commit fence

Source: entries/2026/05/29/topic-wal-commit-vs-metadata-sync-ordering.md

single-kv-size-validated [IN] OBSERVATION

BTree.put raises ValueError for any single key-value pair that cannot fit in a page, blocking the case where even one entry would overflow

Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

single-threaded-by-design [IN] OBSERVATION

BTree has zero synchronization primitives; the sibling chain, split logic, and page allocation are correct only because no concurrent reader or writer can observe intermediate states

Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

single-threaded-masks-immutable-gap [IN] OBSERVATION

In the synchronous single-threaded LSM execution model, the missing immutable memtable lifecycle step causes no observable data loss because _flush() completes atomically within the caller's execution flow

Source: entries/2026/05/29/topic-immutable-memtable-gap.md

size-tiered-triggers-on-sstable-count [IN] OBSERVATION

Size-tiered compaction triggers when the total SSTable count meets the min_threshold parameter

Source: entries/2026/05/29/sstable-and-compaction-testertestsstable.md

snapshot-excludes-uncommitted-writes [IN] OBSERVATION

snapshot only includes committed versions (committs <= snapshotts); all three call sites manually overlay tx.writes and tx._deletes when they need the transaction-local view

Source: entries/2026/05/29/write-skew-detection-ssidatabase-snapshot.md

snapshot-keyed-by-projection-name [IN] OBSERVATION

Each projection's snapshot is stored under self.name in store.snapshots, so two projections with the same name on the same store will overwrite each other

Source: entries/2026/05/29/topic-snapshot-storage-format.md

snapshot-loss-triggers-full-replay [IN] OBSERVATION

When event sourcing snapshots are missing, the system falls back to replaying all events from the log to reconstruct aggregate state — correct but O(n) in total event count

Source: entries/2026/05/29/topic-snapshot-persistence.md

snapshot-returns-detached-copy [IN] OBSERVATION

SSIDatabase.snapshot returns a new dict each call; callers in readpredicate and commit mutate it freely (overlaying writes, removing deletes) without corrupting the MVCC store

Source: entries/2026/05/29/write-skew-detection-ssidatabase-snapshot.md

snapshot-state-excludes-subscription [IN] OBSERVATION

Snapshot data (store.snapshots) contains only state and position; subscriber callbacks registered in EventStore._subscribers are not captured, so restoring a snapshot does not restore live-update behavior

Source: entries/2026/05/29/topic-projection-snapshot-interaction.md

snapshot-uses-next-timestamp-for-current-state [IN] OBSERVATION

During commit validation, snapshot(self.nexttimestamp) captures all committed versions because nexttimestamp is strictly greater than any existing committs

Source: entries/2026/05/29/write-skew-detection-ssidatabase-snapshot.md

snapshots-are-in-memory-only [IN] OBSERVATION

Projection snapshots are stored in store._snapshots (an in-memory dict), not persisted to disk, so they are lost on process restart

Source: entries/2026/05/29/event-sourcing-store-event_store.md

snapshots-dict-is-monkey-patched [IN] OBSERVATION

The snapshots attribute is not declared on EventStore; it is lazily created by Projection.savesnapshot via hasattr/setattr on the store instance

Source: entries/2026/05/29/topic-snapshot-storage-format.md

sort-external-merge [IN] OBSERVATION

Sort stage spills sorted chunks to temp JSONL files when the in-memory buffer exceeds memory_limit, then uses heapq.merge with a KeyedRecord wrapper (seq tiebreaker) for stable k-way merge

Source: entries/2026/05/29/batch-word-count-pipeline.md

sort-merge-join-detects-presorted-input [IN] OBSERVATION

SortMergeJoin inspects whether input is already sorted and reports this via stats["sorted_input"], allowing it to skip the sort step on pre-sorted data

Source: entries/2026/05/29/map-side-join-testmapside_joins.md

sparse-index-limits-corruption-blast-radius [IN] OBSERVATION

A corrupted data entry in an SSTable only affects lookups within one sparse index segment (between two adjacent index offsets, default 16 entries); keys indexed by other segments remain correctly accessible

Source: entries/2026/05/29/topic-sstable-footer-integrity.md

split-brain-delivers-two-hops [IN] OBSERVATION

Election messages from demoted leaders are delivered one level deep (election message + immediate response), not recursively to convergence; the next tick() finishes any remaining propagation.

Source: entries/2026/05/29/leader-election-leaderelection-resolvesplitbrain.md

split-brain-ignores-unavailable-nodes [IN] OBSERVATION

Only nodes where is_available() is True are considered when detecting and resolving split-brain; crashed nodes' stale leader state is ignored.

Source: entries/2026/05/29/leader-election-leaderelection-resolvesplitbrain.md

split-brain-not-recorded-in-history [IN] OBSERVATION

Elections triggered by resolvesplitbrain do not call recordleader, so leadership changes from split-brain resolution are absent from electionhistory.

Source: entries/2026/05/29/leader-election-leaderelection-resolvesplitbrain.md

split-brain-runs-after-message-drain [IN] OBSERVATION

resolvesplitbrain executes only after the tick's while allmessages delivery loop has fully drained, never mid-delivery.

Source: entries/2026/05/29/leader-election-leaderelection-resolvesplitbrain.md

split-chain-rewiring-order [IN] OBSERVATION

During a leaf split, the new right page is written with the old leaf's next_sibling before the old leaf is rewritten to point to the new page; WAL ordering ensures the pointer target is durable before anything references it

Source: entries/2026/05/29/topic-sibling-pointer-maintenance.md

ssi-commit-returns-dict-not-exception [IN] OBSERVATION

SSIDatabase.commit() returns a structured dict {"committed": bool, "reason": str|None, "conflicts": list} rather than raising, letting callers inspect what conflicted for retry logic

Source: entries/2026/05/29/write-skew-detection-ssi_database.md

ssi-extends-mvcc-with-dependency-tracking [IN] OBSERVATION

SSIDatabase adds a dependencygraph and predicatelocks on top of MVCC to detect read-write conflicts (write skew) that basic snapshot isolation permits

Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

ssi-own-writes-require-buffer-check [IN] OBSERVATION

SSIDatabase.read must check tx.writes and tx.deletes before consulting _store, since a transaction's own writes aren't materialized into the shared store until commit — unlike MVCCDatabase where eager writes make own-writes visible automatically

Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

ssi-pessimistic-mode-aborts-at-read [IN] OBSERVATION

When pessimistic is true, checkreadconflict raises RuntimeError and sets tx._status = "aborted" at read time rather than waiting for commit — eager detection vs optimistic validation

Source: entries/2026/05/29/write-skew-detection-ssi_database.md

ssi-phantom-detection-uses-predicate-reevaluation [IN] OBSERVATION

Phantom detection re-evaluates each stored predicate function against current committed state at commit time, comparing both key sets and values to the snapshot-time result

Source: entries/2026/05/29/write-skew-detection-ssi_database.md

ssi-read-only-skip-validation [IN] OBSERVATION

Read-only transactions (empty write set) bypass all conflict detection in commit() and always commit successfully — they cannot cause write skew because they don't write

Source: entries/2026/05/29/write-skew-detection-ssi_database.md

ssi-serializability-through-layered-invariants [IN] OBSERVATION

SSI's commit path relies on three mutually reinforcing properties: read-only transactions skip validation (safe because the store contains only committed data from prior commits), and within a single transaction, writes and deletes for the same key are mutually exclusive, preventing conflicting modifications to that key within the transaction's own write set.

ssi-store-contains-only-committed-data [IN] OBSERVATION

SSIDatabase.store only contains versions stamped with commit timestamps; uncommitted writes live in tx.writes until commit, so every version in the store is guaranteed to be from a committed transaction

Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

ssi-visibility-is-timestamp-comparison [IN] OBSERVATION

SSIDatabase.visiblevalue determines visibility by a single comparison (committs <= snapshotts) with no active-set tracking, because the deferred-write model guarantees all stored versions are committed

Source: entries/2026/05/29/topic-snapshot-vs-ssi-tradeoffs.md

ssi-writes-deletes-mutually-exclusive [IN] OBSERVATION

write() discards any pending delete() for the same key and vice versa; a key is in tx.writes XOR tx.deletes, never both simultaneously

Source: entries/2026/05/29/write-skew-detection-ssi_database.md

sstable-block-size-is-index-interval [IN] OBSERVATION

SSTableWriter's block_size parameter controls the sparse index frequency (one index entry per N records), not a physical byte-aligned block size; records are written contiguously with no padding or alignment to fixed byte boundaries.

Source: entries/2026/05/29/topic-block-based-wal-format.md

sstable-compaction-manager-copies-list [IN] OBSERVATION

CompactionManager.getsstables() returns list(self.sstables) (a shallow copy), preventing iterator invalidation from concurrent mutation but not preventing file-deletion races on the underlying SSTable files

Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

sstable-compaction-manager-lacks-bottommost-detection [IN] OBSERVATION

The CompactionManager tracks maxlevels (default 7) but has no logic to check whether lower levels contain overlapping key ranges before deciding tombstone removal, unlike LevelDB/RocksDB which use VersionSet metadata for this

Source: entries/2026/05/29/topic-leveled-compaction-tombstone-policy.md

sstable-crash-leaves-zero-count-header [IN] OBSERVATION

A crash between SSTableWriter creation and finish() leaves entry_count=0 in the file header while data records exist on disk; SSTableReader trusts the header count and reads the file as empty, silently losing all written entries.

Source: entries/2026/05/29/topic-crash-recovery-invariants.md

sstable-entry-types-binary [IN] OBSERVATION

The SSTable format supports exactly two entry types (value and tombstone via TOMBSTONE_MARKER); there is no third type for merge operands, which would be required for RocksDB-style merge support

Source: entries/2026/05/29/topic-rocksdb-merge-operator.md

sstable-format-lacks-integrity-and-efficiency-mechanisms [IN] OBSERVATION

The SSTable format has no integrity or efficiency safeguards: entries carry no CRC or checksum, range scans bypass the sparse index and scan from file start, and the entry count header is trusted without verification — meaning corruption goes undetected, negative lookups are unnecessarily expensive, and truncated files produce silent data loss.

sstable-get-miss-returns-none [IN] OBSERVATION

SSTableReader.get() returns None for keys not found in the table, rather than raising an exception

Source: entries/2026/05/29/sstable-and-compaction-testertestsstable.md

sstable-get-rebuilds-key-list-every-call [IN] OBSERVATION

SSTable.get extracts sparse index keys into a fresh list on every call rather than caching them, making each lookup O(m) in index size before the binary search begins

Source: entries/2026/05/29/log-structured-merge-tree-lsm-get.md

sstable-get-requires-prior-load-index [IN] OBSERVATION

For SSTables loaded from disk, loadindex() must be called before get() or every lookup silently returns (False, None) — the method does not call loadindex() itself

Source: entries/2026/05/29/log-structured-merge-tree-lsm-get.md

sstable-has-magic-and-version [IN] OBSERVATION

The SSTable format (sstable.py:10-11) uses a 4-byte magic b"SSTB" and a 2-byte version field, making it the only binary format in the repo with format versioning

Source: entries/2026/05/29/topic-wal-record-format-evolution.md

sstable-has-timestamps-lsm-ignores-them [IN] OBSERVATION

SSTableEntry in sstable.py carries a timestamp field, but lsm.py's merge and conflict resolution uses only structural SSTable ordering (sequence number priority), not per-entry timestamps

Source: entries/2026/05/29/topic-leveldb-merging-iterator.md

sstable-layer-compounds-integrity-and-performance-deficiencies [IN] OBSERVATION

The SSTable layer has mutually reinforcing integrity and performance deficiencies: no per-entry CRC or checksum means corrupted entries return wrong values silently, and no Bloom filter integration forces every negative lookup to scan all SSTables sequentially, maximizing both the probability and frequency of encountering those silent corruptions.

sstable-lookup-is-two-phase [IN] OBSERVATION

Point lookups in both SSTable implementations first binary-search the sparse index to identify a block, then linearly scan entries within that block's byte range; total cost is O(log(N/B) + B)

Source: entries/2026/05/29/topic-sstable-block-format.md

sstable-magic-is-only-integrity-check [IN] OBSERVATION

The 4-byte magic b"SSTB" at offset 0 is the only data integrity check in the sstable-and-compaction SSTable read path; it gates file-type validation but provides no corruption detection for keys, values, timestamps, or structural offsets

Source: entries/2026/05/29/topic-sstable-magic-number-vs-crc.md

sstable-merge-accepts-reader-objects [IN] OBSERVATION

merge_sstables() takes SSTableReader instances directly as inputs, using a streaming iterator interface rather than loading entire files into memory

Source: entries/2026/05/29/sstable-and-compaction-testertestsstable.md

sstable-merge-dedup-newest-timestamp [IN] OBSERVATION

When multiple SSTables contain the same key, merge_sstables keeps only the entry with the highest timestamp via heap ordering (key, -timestamp), silently discarding all older versions.

Source: entries/2026/05/28/sstable-and-compaction-sstable-merge_sstables.md

sstable-merge-forward-only [IN] OBSERVATION

The k-way merge in sstable-and-compaction/sstable.py is forward-only with no direction tracking or reverse iteration support; there is no direction enum, no FindLargest, and no heap rebuild on direction change.

Source: entries/2026/05/29/topic-leveldb-merger.md

sstable-metadata-has-level-field [IN] OBSERVATION

SSTableMetadata in sstable-and-compaction/sstable.py tracks a level field, indicating the compaction module is structured for leveled compaction even though lsm.py uses flat single-level merges

Source: entries/2026/05/29/topic-mvcc-version-snapshots.md

sstable-metadata-level-field-unused [IN] OBSERVATION

SSTableMetadata.level in sstable-and-compaction/sstable.py exists as a field but has no persistent backing; level assignments would be lost on restart without a MANIFEST equivalent

Source: entries/2026/05/28/topic-manifest-based-compaction.md

sstable-no-checksums [IN] OBSERVATION

The SSTable format has no CRC or checksum on entries or blocks; the only validation is a magic-byte assertion (MAGIC == b"SSTB") on file open.

Source: entries/2026/05/28/sstable-and-compaction-sstable.md

sstable-range-scan-end-exclusive [IN] OBSERVATION

range_scan(start, end) is inclusive on start and exclusive on end.

Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

sstable-range-scan-no-index-skip [IN] OBSERVATION

SSTableReader.range_scan(start, end) scans from the beginning of the file rather than using the sparse index to skip ahead, making it O(N) regardless of range position.

Source: entries/2026/05/28/sstable-and-compaction-sstable.md

sstable-scan-is-streaming [IN] OBSERVATION

SSTable.scan() and scan_all() use Python generators (yield), meaning individual SSTables already support streaming — only the cross-SSTable merge layer materializes

Source: entries/2026/05/28/topic-merge-iterator-vs-dict-merge.md

sstable-short-read-guards-prevent-overrun [IN] OBSERVATION

readentry() in the LSM SSTable checks for short reads on key and value fields, preventing buffer overruns on truncated files but silently skipping all remaining entries in the scan range

Source: entries/2026/05/29/topic-sstable-footer-integrity.md

sstable-sorted-order-caller-responsibility [IN] OBSERVATION

SSTableWriter.add() does not enforce sorted key order; violation silently corrupts binary search in SSTableReader.get().

Source: entries/2026/05/28/sstable-and-compaction-sstable.md

sstable-sparse-index-bounds-corruption [IN] OBSERVATION

A corrupted SSTable data entry only affects lookups within one sparse index segment; the sparse index provides independent entry points into different file regions, bounding corruption blast radius to ~block_size entries

Source: entries/2026/05/29/topic-length-prefix-framing-resilience.md

sstable-sparse-index-every-nth [IN] OBSERVATION

SSTableReader loads a sparse index into memory that records every block_size-th key and byte offset; point lookups binary-search the index then linearly scan within the block.

Source: entries/2026/05/28/sstable-and-compaction-sstable.md

sstable-sparse-index-in-footer [IN] OBSERVATION

Both SSTable implementations store the sparse index at the end of the file with a footer pointer, meaning bloom filters could be co-located in the same footer region without changing the file layout

Source: entries/2026/05/29/topic-bloom-filter-optimization.md

sstable-test-is-script-not-framework [IN] OBSERVATION

test_sstable.py is a linear script using bare assert statements and print progress markers, not a pytest/unittest suite — it runs top-to-bottom in a tempfile.TemporaryDirectory context.

Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

sstable-tombstone-encoding-0xff [IN] OBSERVATION

Tombstones are encoded on disk as a single 0xFF byte in the value position; this is unambiguous because valid value-length fields are 4 bytes big-endian and cannot start with 0xFF at realistic value sizes.

Source: entries/2026/05/28/sstable-and-compaction-sstable.md

sstable-tombstone-is-null-value [IN] OBSERVATION

Tombstones are represented at the API level as SSTableEntry objects with value=None, not as a separate deletion record type.

Source: entries/2026/05/28/sstable-and-compaction-test_sstable.md

sstable-trailer-single-point-of-failure [IN] OBSERVATION

The SSTable's 12-byte trailer (footer_start offset + entry count) is an unprotected single point of failure; its corruption makes the entire file's sparse index and data entries unreachable with no fallback discovery mechanism

Source: entries/2026/05/29/topic-sstable-footer-integrity.md

sstable-write-path-has-dual-fragility [IN] OBSERVATION

The SSTable write path has two independent fragility points: sorted key order is the caller's responsibility with no enforcement (violation silently corrupts binary search), and the file header is written as a placeholder then patched via seek-back after all entries are written (a crash between data write and header patch leaves a structurally invalid file).

sstable-writer-append-then-patch [IN] OBSERVATION

SSTableWriter writes a placeholder header, appends all entries and the sparse index, then seeks back to byte 0 to patch the final entry count — an unfinished SSTable has entry_count=0 and appears empty.

Source: entries/2026/05/28/sstable-and-compaction-sstable.md

sstable-writer-finish-no-fsync [IN] OBSERVATION

sstable-and-compaction/sstable.py:SSTableWriter.finish() closes the file without calling flush() or os.fsync(), relying on implicit Python close-time buffer flush with no disk durability guarantee.

Source: entries/2026/05/29/topic-fsync-guarantees-across-implementations.md

sstable-writer-is-natural-filter-builder [IN] OBSERVATION

The SSTableWriter.add() → finish() lifecycle is the natural integration point for per-SSTable bloom filter construction, since keys are already iterated in sorted order during the write pass

Source: entries/2026/05/28/topic-counting-bloom-compaction-interaction.md

sstable-writer-no-fsync-on-close [IN] OBSERVATION

SSTableWriter.finish() at sstable-and-compaction/sstable.py:91 calls close() without prior fsync(), leaving newly flushed SSTables vulnerable to loss if the OS page cache hasn't been written to disk

Source: entries/2026/05/29/topic-bitcask-crash-recovery-guarantees.md

sstable-writer-single-file-lifecycle [IN] OBSERVATION

SSTableWriter binds to one file descriptor at construction with no method to rotate to a new file mid-write, making it structurally incompatible with size-bounded output splitting

Source: entries/2026/05/29/topic-compaction-output-splitting.md

stage-timing-subtracts-downstream [IN] OBSERVATION

Stage.trackedprocess subtracts time spent after each yield (downstream processing time) to attribute wall-clock time accurately to each individual stage

Source: entries/2026/05/29/batch-word-count-pipeline.md

standard-bloom-suffices-for-immutable-sstables [IN] OBSERVATION

Standard bloom filters lack deletion support, but SSTables are write-once-then-discarded, so deletion is never needed and the 8x space savings over counting filters is free

Source: entries/2026/05/29/topic-bloom-filter-on-sstables.md

stcs-compact-mutates-sstable-list [IN] OBSERVATION

stcscompact both removes old and appends new entries to self._sstables as a side effect; callers must not hold references into the list across a compaction call.

Source: entries/2026/05/29/sstable-and-compaction-sstable-stcscompact.md

stcs-merges-all-qualifying-buckets-per-call [IN] OBSERVATION

Size-tiered runcompaction() processes every bucket that meets minthreshold in a single call, while leveled compaction processes at most one level per call.

Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

stcs-merges-within-size-tiers [IN] OBSERVATION

Size-tiered compaction only merges SSTables in the same size tier (determined by gettier() comparing filesize against size_thresholds); it never merges a small SSTable directly with a much larger one.

Source: entries/2026/05/29/sstable-and-compaction-sstable-stcscompact.md

stcs-min-threshold-default-four [IN] OBSERVATION

A size tier must accumulate at least 4 SSTables (the default minthreshold) before stcscompact merges them; buckets below this threshold are skipped entirely.

Source: entries/2026/05/29/sstable-and-compaction-sstable-stcscompact.md

stcs-overlapping-key-ranges [IN] OBSERVATION

Size-tiered compaction allows multiple SSTables at the same tier to contain overlapping key ranges, requiring point lookups to check all SSTables in a tier

Source: entries/2026/05/29/topic-leveled-vs-size-tiered-compaction.md

storage-crash-recovery-has-no-safe-path [IN] OBSERVATION

No storage engine has a fully safe crash recovery path: compaction lacks atomicity, WAL replay ignores batch boundaries, and CRC checksums leave routing metadata unprotected — corruption can enter via unprotected metadata, persist through non-validating compaction, and survive recovery via batch-unaware replay.

storage-has-no-self-healing-at-any-layer [IN] OBSERVATION

Storage engines degrade monotonically during normal operation (leaked pages, growing height, no rebalancing) and have no safe recovery after crashes (non-atomic compaction, partial batch replay, unchecked metadata) — there is no path from degraded state back to healthy state.

storage-operations-have-unbounded-memory-consumption [IN] OBSERVATION

Three core storage operations materialize entire datasets in memory with no streaming, pagination, or backpressure: LSM range scans load all matching entries into a dict before returning, compaction buffers all merged entries into a list before writing, and SSTable lookups rebuild the sparse index key list on every call — creating O(n) memory pressure proportional to total data size rather than result size.

store-range-scan-passes-global-bounds [IN] OBSERVATION

RangePartitionedStore.rangescan passes the same global startkey/endkey to every overlapping partition rather than clamping to partition boundaries, relying on bisectleft to naturally exclude out-of-range keys.

Source: entries/2026/05/29/range-partitioning-rangepartitioning-rangescan.md

strategy-enum-covers-two-of-three [IN] OBSERVATION

ConflictStrategy implements LWW and custom merge but not CRDTs; the CRDT module exists separately in crdts.py with no integration into the multi-leader replication strategy dispatch

Source: entries/2026/05/29/topic-conflict-resolution-strategies.md

strategy-selection-is-binary [IN] OBSERVATION

CompactionManager compares strategy with == against "size_tiered" — any other string silently selects leveled compaction with no validation or error.

Source: entries/2026/05/29/sstable-and-compaction-sstable-CompactionManager.md

stream-join-bounds-materialization-via-watermarks [IN] OBSERVATION

StreamJoinProcessor expires buffered events when they fall below watermark - window_duration, bounding memory at the cost of potentially dropping late-arriving matches

Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

stream-join-buffers-bounded-by-window [IN] OBSERVATION

The processor actively expires events so buffer sizes remain proportional to windowduration × eventrate, not total events processed; tests confirm buffers stay under 10 per side after 1000 events with a 5-second window.

Source: entries/2026/05/29/stream-join-processor-teststreamjoin_processor.md

stream-join-correctness-depends-on-window-alignment [IN] OBSERVATION

Stream join correctness depends on consistent alignment across three independent window mechanisms: join and aggregation windows are configured independently with no constraint preventing misalignment, expiration uses a one-sided cutoff that doesn't match the symmetric matching predicate, and buffer size is bounded only by the window duration relative to watermark advancement rate.

stream-join-get-results-destructive [IN] OBSERVATION

get_results() swaps out and resets the accumulated results list; callers see only results since the previous drain (pull-based consumption model)

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

stream-join-inner-requires-key-and-window-match [IN] OBSERVATION

Inner join only produces a JoinResult when both key equality and |tleft - tright| <= window_duration hold; either condition failing alone prevents a match.

Source: entries/2026/05/29/stream-join-processor-teststreamjoin_processor.md

stream-join-late-events-dropped-silently [IN] OBSERVATION

Events arriving after watermark - allowedlateness are dropped and counted in stats.lateevents_dropped; the processor never raises exceptions for late arrivals.

Source: entries/2026/05/29/stream-join-processor-teststreamjoin_processor.md

stream-join-left-asymmetric [IN] OBSERVATION

JoinType.LEFT emits misses only for unmatched left-side events; unmatched right-side events are silently dropped without any miss emission

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

stream-join-miss-deferred [IN] OBSERVATION

Outer-join misses are only emitted at event expiration time (when timestamp falls below watermark minus window duration), never eagerly on arrival

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

stream-join-no-stream-validation [IN] OBSERVATION

The processor does not validate that an event's streamname matches either configured stream; an unknown stream name is silently treated as the right stream via is_left returning False

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

stream-join-one-to-many [IN] OBSERVATION

A single event can match multiple events on the opposite side; the matched flag is set on all participants, so none produce outer-join misses

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

stream-join-process-event-returns-immediate-matches [IN] OBSERVATION

process_event() returns matches synchronously on each call (push-based) rather than batching to a flush boundary; results are available the instant a match is found.

Source: entries/2026/05/29/stream-join-processor-teststreamjoin_processor.md

stream-join-watermark-monotonic [IN] OBSERVATION

The join processor's watermark only advances forward via max(current, event.timestamp); advance_time() silently ignores timestamps at or below the current watermark

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

stream-version-is-event-count [IN] OBSERVATION

streamversion() returns the count of events in a stream (not the latest eventid), and expected_version checks against this count for optimistic concurrency

Source: entries/2026/05/29/event-sourcing-store-event_store.md

subscriber-dispatch-is-synchronous [IN] OBSERVATION

Event subscriber notification in EventStore blocks the append() caller — projections update on the writer's thread before the append call returns, trading throughput for zero-gap consistency

Source: entries/2026/05/29/topic-live-projection-subscription-mechanism.md

sync-all-two-rounds-suffices [IN] OBSERVATION

CRDTReplicaGroup.sync_all runs exactly 2 rounds of all-pairs sync, which is sufficient for convergence of any state-based CRDT with a commutative/associative/idempotent merge

Source: entries/2026/05/29/conflict-free-replicated-data-types-crdts.md

sync-mode-none-skips-per-write-fsync [IN] OBSERVATION

When syncmode="none", the per-write do_sync() call (default force=False) is a complete no-op — no fsync and no batch queue — trading durability for maximum write throughput

Source: entries/2026/05/29/topic-sync-mode-none-safety.md

system-converges-on-permanent-dark-failure [IN] OBSERVATION

The system converges on permanent dark failure: there is no operational path to correctness at any scale AND failure is irreversible and undetectable, meaning the system silently accumulates data loss with no diagnostic signal, operational remedy, or evolutionary escape — even the decision to rebuild requires evidence the system cannot produce.

system-degrades-monotonically-at-every-abstraction-level [IN] OBSERVATION

The system degrades monotonically at every abstraction level with no equilibrium: storage engines erode structurally with no self-healing (leaked pages, growing height, no rebalancing, unsafe recovery), and the distributed layer's compensating mechanisms (anti-entropy, read repair) cannot fully resolve the resulting replica divergence, which accumulates without bound.

system-failure-is-undetectable-and-irreversible [IN] OBSERVATION

The system's natural trajectory is toward permanent undetectable data loss: corruption propagates silently through every pipeline without detection, AND the structural degradation that enables it is irreversible at every abstraction level — failure is invisible while it happens and unfixable after it's discovered.

system-has-no-stable-operational-regime [IN] OBSERVATION

The system has no stable operational regime at any scale: storage degrades monotonically during normal operation with no self-healing, failures widen correctness gaps at every abstraction level, and no storage paradigm can escape by scaling (hash indexes hit memory walls, LSM hits probe walls) — the architecture converges toward failure under every operating condition.

system-has-no-undo-at-any-layer [IN] OBSERVATION

No layer in the system can reverse a completed operation: the WAL stores only new values with no before-images (making rollback structurally impossible), and transaction abort is a status-flag change that leaves all written versions on disk, meaning neither the storage layer nor the logical transaction layer supports undo.

system-neither-verifiable-nor-repairable [IN] OBSERVATION

The system is trapped in a verification-repair deadlock: verification is impossible at every architectural layer (integrity checks degrade along the pipeline, protocol safety is unfalsifiable under current testing) AND the rigid binary format design across the storage stack prevents adding verification or self-healing capabilities through evolution, foreclosing both detection and correction of faults.

term-partitioned-point-query-is-targeted [IN] OBSERVATION

TermPartitionedDB.querybyfield looks up a single index partition by term hash then fetches documents from their home partitions, touching 1 + K partitions (K = distinct data partitions) instead of scattering to all.

Source: entries/2026/05/29/secondary-index-partitioning-secondaryindexpartitioning.md

term-partitioned-write-touches-multiple [IN] OBSERVATION

TermPartitionedDB.put touches 1 + N partitions where N depends on how many distinct index partitions the document's field values hash to, because index entries live on term-hashed partitions separate from the data partition.

Source: entries/2026/05/29/secondary-index-partitioning-secondaryindexpartitioning.md

test-block-size-differs-from-default [IN] OBSERVATION

Tests universally use block_size=4 while SSTableWriter defaults to 64 and lsm.py defaults to 16; the small test value forces multi-block index paths with small datasets

Source: entries/2026/05/29/topic-sstable-block-format.md

tester-dev-suites-overlap-not-hierarchical [IN] OBSERVATION

The tester and developer test suites have overlapping but non-hierarchical coverage: tester files uniquely test spec-example compliance and cross-path equivalence, while dev files uniquely test crash recovery and internal state

Source: entries/2026/05/29/topic-test-coverage-gaps.md

tester-files-are-generated-artifacts [IN] OBSERVATION

The testertest*.py files are generated by the code-expert workflow's tester stage from implementation specs, distinct from hand-written test_*.py pytest files

Source: entries/2026/05/29/topic-tester-runner-system.md

tester-files-are-hand-written [IN] OBSERVATION

The testertest*.py files contain no auto-generation markers and are hand-authored standalone test suites, not generated by code-expert or any other pipeline — contradicts existing tester-files-are-generated-artifacts

Source: entries/2026/05/29/topic-code-expert-generation-pipeline.md

tester-files-are-standalone-runnable [IN] OBSERVATION

Every testertest*.py file contains an if _name == "main_": block, making it executable via python without pytest

Source: entries/2026/05/29/topic-tester-vs-test-convention.md

tester-files-never-import-internals [IN] OBSERVATION

Tester test files import only the public API (e.g., from btree import BTree), while developer test files import internal types like WAL, serializeleaf, and HEADER_FMT for state injection and internal invariant checking

Source: entries/2026/05/29/topic-test-coverage-gaps.md

tester-files-use-stdout-validation [IN] OBSERVATION

Tester files print "test_name PASSED" to stdout, indicating an output-parsing runner rather than pytest's exit-code-based reporting

Source: entries/2026/05/29/topic-tester-vs-test-convention.md

tester-naming-avoids-pytest-discovery [IN] OBSERVATION

The testertest*.py prefix does not match pytest's default collection patterns (test*.py / *test.py), so these files are excluded from pytest runs unless explicitly specified

Source: entries/2026/05/29/topic-tester-vs-test-convention.md

tester-print-includes-diagnostics [IN] OBSERVATION

Tester files embed runtime metrics in their pass/fail output (tree height, page counts, pass/fail ratios) that pytest's structured reporting does not surface without the -s flag

Source: entries/2026/05/29/topic-tester-vs-pytest-test-duality.md

tester-pytest-boundary-is-leaky [IN] OBSERVATION

At least 6 testertest*.py files import pytest despite being designed for standalone execution, indicating the two suites share lineage rather than being fully independent

Source: entries/2026/05/29/topic-tester-vs-test-convention.md

tester-spec-compliance-is-unique [IN] OBSERVATION

TestSpecExample tests in tester files replay the exact spec usage example verbatim as a living executable spec, a form of spec-drift detection that developer test suites do not replicate

Source: entries/2026/05/29/topic-test-coverage-gaps.md

tester-test-decomposition [IN] OBSERVATION

Tester files decompose monolithic pytest tests into smaller focused functions with docstrings (e.g., testbasic becomes testbasicputget + testrangescan + test_persistence)

Source: entries/2026/05/29/topic-code-expert-generation-pipeline.md

tester-test-dual-runner [IN] OBSERVATION

Some testertest*.py files are pytest-only (using fixtures/pytest.raises), while others include an if _name == 'main_': block for standalone execution; the two execution patterns coexist across modules

Source: entries/2026/05/29/topic-tester-test-file-locations.md

tester-test-files-not-excluded [IN] OBSERVATION

No collectignore, norecursedirs, testpaths, or pythonfiles directive exists anywhere in the repo, so testertest*.py files will be collected by pytest's default *_test.py glob

Source: entries/2026/05/29/topic-pytest-configuration.md

tester-test-no-shared-harness [IN] OBSERVATION

There is no shared conftest.py, base class, or test harness across testertest*.py files; each imports only its own module under test plus standard library utilities and is fully self-contained

Source: entries/2026/05/29/topic-tester-test-file-locations.md

tester-tests-are-distilled-specs [IN] OBSERVATION

Each testertest*.py is a simplified subset of its corresponding test_*.py, organized by numbered behavioral properties (# 1. No false negatives, # 2. FPR within 2x of target) rather than exhaustive edge cases

Source: entries/2026/05/29/topic-tester-generation-pipeline.md

tester-tests-not-auto-regenerated [IN] OBSERVATION

testertest*.py files contain no auto-generation markers (generated, DO NOT EDIT) and no regeneration tooling exists in the documented code-expert workflow; they are static committed files

Source: entries/2026/05/29/topic-tester-generation-pipeline.md

tester-tests-public-contract [IN] OBSERVATION

testertest*.py files test the public API contract of each implementation, importing only public symbols (plus format constants for fault injection), acting as external conformance harnesses

Source: entries/2026/05/29/log-structured-hash-table-testertestbitcask.md

tester-versions-have-more-test-functions [IN] OBSERVATION

The testertest*.py files consistently contain more test functions than their test_*.py counterparts (bloom-filter: 11 vs 5, gossip-protocol: 12 vs 10), and include tests for features not covered in the default-discovered files

Source: entries/2026/05/29/topic-pytest-default-collection.md

testing-covers-neither-crash-nor-async-failure-modes [IN] OBSERVATION

The testing methodology covers neither single-node crash failures nor distributed asynchronous failures: crash recovery paths are systematically untested (no torn-write tests, no compaction crash tests), and distributed protocols are validated only under synchronous deterministic simulation with no real network I/O — the entire failure surface area from storage crashes to network partitions is invisible to the test suite.

tob-competing-proposals-bump-rounds [IN] OBSERVATION

When a TOB slot is decided by another proposer with a different value, the losing node re-proposes its value for a new slot rather than retrying the same slot with a higher round

Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

tob-consensus-uses-paxos-ballot-numbers [IN] OBSERVATION

ConsensusInstance.prepare(n) / accept(n, val) use monotonic proposal numbers where a higher prepare preempts a lower one, causing accept with a stale number to return {accepted: False}

Source: entries/2026/05/29/total-order-broadcast-testtotalorder_broadcast.md

tob-contiguous-delivery [IN] OBSERVATION

Messages are delivered to the application only when all prior slots are decided; nextslot advances through a contiguous run, and a gap (undecided slot N when N+1 is decided) blocks all later delivery

Source: entries/2026/05/29/total-order-broadcast-totalorderbroadcast.md

tob-full-paxos-per-slot [IN] OBSERVATION

Every slot in Total Order Broadcast runs the complete two-phase Paxos protocol (Prepare→Promise→Accept→Accepted) from scratch; no state carries between slots to skip Phase 1

Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

tob-linearizable-reads-via-broadcast [IN] OBSERVATION

LinearizableRegister.read() broadcasts the read through consensus rather than reading locally, establishing a consistent point in the total order — necessary for linearizability without leader leases

Source: entries/2026/05/29/total-order-broadcast-totalorderbroadcast.md

tob-linearizable-register-is-built-on-broadcast [IN] OBSERVATION

LinearizableRegister wraps TOBCluster to provide write()/read()/compareandset(), demonstrating the DDIA Chapter 9 equivalence between total-order broadcast and linearizable storage

Source: entries/2026/05/29/total-order-broadcast-testtotalorder_broadcast.md

tob-no-leader-concept [IN] OBSERVATION

TOBNode has no leader election or lease mechanism; any node can propose for any undecided slot at any time, making it vulnerable to competing proposals

Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

tob-no-persistent-storage [IN] OBSERVATION

TOBNode crash/recovery uses state transfer from peers via force_decide() on each missed slot; there is no WAL, persistent log, or on-disk state — all durability depends on a majority staying alive

Source: entries/2026/05/29/total-order-broadcast-totalorderbroadcast.md

tob-paxos-per-slot [IN] OBSERVATION

Each slot in the total order log is decided by an independent single-decree Paxos instance (ConsensusInstance); there is no stable-leader optimization or Multi-Paxos Phase 1 skip

Source: entries/2026/05/29/total-order-broadcast-totalorderbroadcast.md

tob-preempted-value-requeued [IN] OBSERVATION

When a proposer's value loses its slot to a competing value, the original is pushed back onto _pending for proposal in a later slot, ensuring no broadcast messages are silently dropped

Source: entries/2026/05/29/total-order-broadcast-totalorderbroadcast.md

tob-proposal-number-encodes-node-id [IN] OBSERVATION

Proposal numbers are computed as round * numnodes + nodeid, guaranteeing uniqueness across nodes within the same round but restarting at round 0 for each slot independently

Source: entries/2026/05/29/topic-multi-paxos-vs-single-decree.md

tob-quorum-is-strict-majority [IN] OBSERVATION

The TOB cluster uses 3 nodes where 2 can make progress but 1 alone cannot, confirming a strict-majority quorum requirement for both Paxos Phase 1 and Phase 2

Source: entries/2026/05/29/total-order-broadcast-testtotalorder_broadcast.md

tob-recovery-replays-missed-slots [IN] OBSERVATION

After recover_node(), the recovered node's delivery order matches live nodes' order, meaning recovery replays all consensus decisions made while the node was down via state transfer

Source: entries/2026/05/29/total-order-broadcast-testtotalorder_broadcast.md

tob-uses-single-decree-paxos-per-slot [IN] OBSERVATION

Each slot in the total order broadcast is decided by an independent ConsensusInstance running single-decree Paxos with prepare/accept phases (totalorderbroadcast.py:3).

Source: entries/2026/05/29/topic-linearizable-reads-via-tob.md

tob-value-adoption-on-prepare [IN] OBSERVATION

In handleprepare_response, if any acceptor reports a previously accepted value, the proposer must adopt the one with the highest proposal number — this is the core Paxos safety invariant preventing decided values from being overwritten

Source: entries/2026/05/29/total-order-broadcast-totalorderbroadcast.md

tombstone-lifecycle-fragmented-across-modules [IN] OBSERVATION

Tombstone management is handled differently at every layer: the LSM uses an ambiguous empty-bytes sentinel indistinguishable from empty values, compaction preserves tombstones by default with caller-controlled removal, and distributed deletion requires cross-replica convergence before safe removal — with no coordination between these concerns.

tombstone-marker-single-byte-no-redundancy [IN] OBSERVATION

The tombstone/value distinction in sstable-and-compaction/sstable.py rests on a single byte (0xFF vs first byte of a 4-byte value length), making it vulnerable to single-bit corruption that silently converts live entries to deletions or vice versa

Source: entries/2026/05/29/topic-sstable-magic-number-vs-crc.md

tombstone-removal-requires-full-key-coverage [IN] OBSERVATION

A tombstone for key K can only be safely removed during compaction if every SSTable that could contain an older entry for K is included in that compaction run; otherwise the deleted key can be resurrected from a surviving SSTable

Source: entries/2026/05/28/topic-tombstone-gc-safety.md

tombstone-removes-cross-segment [IN] OBSERVATION

A tombstone record in segment N removes the key from _index even if the key's live value was written in an earlier segment, because the index is a flat dict shared across all segment scans.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-scansegment.md

tombstone-reported-as-none-in-conflict [IN] OBSERVATION

ConflictRecord.remotevalue reports None for tombstoned changes rather than the internal TOMBSTONE sentinel; callers see deletion as None and never observe the sentinel object.

Source: entries/2026/05/29/multi-leader-replication-multileader-applyremote_change.md

tombstone-sentinel-is-a-forbidden-value [IN] OBSERVATION

Storing b"_BITCASKTOMBSTONE_" as a legitimate value in log-structured-hash-table causes silent data loss on next recovery, as scan_segment interprets it as a delete — the sentinel is an implicit API constraint with no validation

Source: entries/2026/05/29/topic-tombstone-encoding.md

topology-creates-divergence-window-not-correctness-gap [IN] OBSERVATION

Replication topology affects only the duration of observable divergence, not the final convergence outcome: all topologies use identical deterministic LWW resolution, but ring topology creates O(N) rounds of divergence while all-to-all converges in O(1) — with no adaptive topology switching, the system cannot minimize the divergence window in response to cluster size or partition conditions.

topology-does-not-change-conflict-outcome [IN] OBSERVATION

Both topologies use the same deterministic (timestamp, node_id) comparison for LWW resolution; topology affects when and where conflicts are detected, not which value wins.

Source: entries/2026/05/29/topic-ring-vs-all-to-all-propagation-delay.md

torn-length-prefix-causes-silent-skip [IN] OBSERVATION

If a torn write corrupts the 4-byte length prefix in wal.py:readrecord, the reader interprets garbage as record_length, reads that many bytes (consuming valid subsequent records as data), then returns None on short read — no error is raised and no resync is attempted

Source: entries/2026/05/29/topic-block-aligned-wal-records.md

transaction-isolation-composes-two-invariant-layers [IN] OBSERVATION

Serializable snapshot isolation is achieved by composing two complementary invariant layers: MVCC provides visibility correctness (append-only versions, own-writes guarantee, symmetric deletion visibility) while SSI adds serializability enforcement (read-only optimization, committed-only snapshots, write-delete mutual exclusion).

transaction-isolation-fragile-under-restart [IN] OBSERVATION

Transaction isolation's carefully composed two-invariant-layer model (MVCC visibility plus SSI conflict detection) is fragile under restart: abort is a status-change-only operation that leaves written data on disk, and MVCC counters that must be monotonic have no persistence mechanism, so a crash both resurrects aborted writes and breaks the ordering invariant that determines visibility.

transaction-recovery-has-no-crash-path [IN] OBSERVATION

Transaction abort is a status-change-only operation with no disk rollback, and MVCC correctness depends on monotonic counters that have no persistence mechanism, meaning a crash simultaneously leaves aborted-transaction writes on disk and resets the ordering counters that determine visibility — both invariants fail together.

transfer-dict-keyed-by-arc [IN] OBSERVATION

The transfer dict uses (arcstart, arcend) tuples as keys, so two vnodes producing identical arc boundaries would silently overwrite rather than accumulate.

Source: entries/2026/05/29/topic-transfer-map-accuracy.md

transfer-test-weak-assertions [IN] OBSERVATION

testaddnodereturnstransfers only asserts transfer direction (A→B) and existence, not arc non-overlap or total size correctness — the code is correct but the test doesn't prove it.

Source: entries/2026/05/29/topic-transfer-map-accuracy.md

tree-height-monotonically-increases [IN] OBSERVATION

Since neither the reference implementation nor PostgreSQL merges internal nodes on delete, tree height only grows (on root splits) and never shrinks, even under heavy deletion

Source: entries/2026/05/28/topic-postgres-nbtree-lazy-deletion.md

truncate-advances-base-offset-additively [IN] OBSERVATION

Topic.truncate updates baseoffsets[partition] with += actual, a relative shift, because it only removes a contiguous prefix from the front of the partition log.

Source: entries/2026/05/29/topic-log-compaction-vs-retention.md

truncate-plus-per-record-crc-is-dangerous-combination [IN] OBSERVATION

The non-atomic truncate() in wal.py can produce reordered or incomplete files that pass per-record CRC validation; chained checksums would detect the corruption at the chain break point

Source: entries/2026/05/29/topic-cumulative-checksums.md

tumbling-window-aligned-to-zero [IN] OBSERVATION

TumblingWindowAggregator aligns window boundaries to multiples of the window size starting from 0 — windows are [0, size), [size, 2*size), etc., not relative to the first event's timestamp.

Source: entries/2026/05/29/stream-join-processor-teststreamjoin_processor.md

tumbling-window-floor-division [IN] OBSERVATION

TumblingWindowAggregator assigns windows by floor-dividing the timestamp by window size, producing aligned non-overlapping boundaries regardless of when events arrive

Source: entries/2026/05/29/stream-join-processor-streamjoinprocessor.md

two-sstable-implementations-same-pattern [IN] OBSERVATION

lsm.py (using sparseindexinterval=16 and bisect) and sstable.py (using block_size=64 and manual binary search) implement the same sparse-index-with-block-scan pattern with different defaults, naming, and file format maturity

Source: entries/2026/05/29/topic-sstable-block-format.md

two-wal-designs-in-repo [IN] OBSERVATION

The standalone wal.py uses a logical WAL (keyed PUT/DELETE operations with COMMIT markers) while btree.py uses a physical WAL (raw page images); they solve different problems and have different recovery semantics

Source: entries/2026/05/29/topic-write-ahead-logging-fsync-ordering.md

unbundled-catchup-rebuild-equivalence [IN] OBSERVATION

Catch-up via snapshotandstream and rebuild via full CDC event replay must produce identical derived-system state, verified by comparing get_state() output from two independent derived systems

Source: entries/2026/05/29/unbundled-database-testtestervalidation.md

unbundled-db-catch-up-replays-history [IN] OBSERVATION

A derived system added with catch_up=True receives all historical CDC events to reach current state, without requiring a separate snapshot mechanism.

Source: entries/2026/05/29/unbundled-database-testunbundleddatabase.md

unbundled-db-cdc-events-carry-old-value [IN] OBSERVATION

Every update and delete CDC event includes the previous value (oldvalue), enabling derived systems to undo prior state; inserts have oldvalue=None.

Source: entries/2026/05/29/unbundled-database-testertestunbundled_database.md

unbundled-db-composes-via-log [IN] OBSERVATION

The unbundled database wires independent subsystems (WAL, storage engine, CDC, derived systems) through a shared append-only log with independent consumer positions, applying Unix-style composition at the system architecture level

Source: entries/2026/05/29/topic-ddia-chapter-10-batch-processing.md

unbundled-db-flush-required-for-derived [IN] OBSERVATION

Derived systems (secondary indexes, materialized views, full-text search) only see mutations after db.flush() is called; writes go to WAL and storage immediately but CDC consumers are decoupled.

Source: entries/2026/05/29/unbundled-database-testunbundleddatabase.md

unbundled-db-is-log-first-cdc-is-state-first [IN] OBSERVATION

The unbundled database writes to WAL before storage engine, while the CDC module writes to in-memory rows before appending the log — opposite ordering of the source-of-truth relationship

Source: entries/2026/05/29/topic-cdc-flush-semantics.md

unbundled-db-put-returns-cdc-event [IN] OBSERVATION

UnbundledDatabase.put() and delete() return CDCEvent objects directly, making CDC a synchronous, first-class part of the write API rather than a side-channel.

Source: entries/2026/05/29/unbundled-database-testunbundleddatabase.md

unbundled-db-rebuild-equals-live [IN] OBSERVATION

rebuildsystem() must produce state identical to incremental live processing — this is a tested invariant verified by capturing getstate() before and after rebuild.

Source: entries/2026/05/29/unbundled-database-testunbundleddatabase.md

unbundled-db-wal-lsn-starts-at-one [IN] OBSERVATION

The unbundled database WAL assigns LSNs starting from 1 (not 0), with latest_lsn == 0 indicating an empty log.

Source: entries/2026/05/29/unbundled-database-testertestunbundled_database.md

unbundled-db-wal-persistence-jsonl [IN] OBSERVATION

The unbundled database's WAL supports optional file-backed persistence via a .jsonl file path, and a new WriteAheadLog instance recovers entries from that file on init.

Source: entries/2026/05/29/unbundled-database-testunbundleddatabase.md

unbundled-flush-zeroes-lag [IN] OBSERVATION

After db.flush(), get_lag() returns 0 for all derived systems, confirming all pending CDC events have been consumed

Source: entries/2026/05/29/unbundled-database-testtestervalidation.md

unbundled-lsn-sequential [IN] OBSERVATION

LSNs returned by the unbundled database's put()/delete() are 1-indexed and strictly sequential with no gaps

Source: entries/2026/05/29/unbundled-database-testtestervalidation.md

unbundled-rebuild-clears-state [IN] OBSERVATION

StorageEngine.rebuild(wal) clears all existing data (including manually injected entries) before replaying WAL entries, ensuring no phantom or stale state survives a rebuild

Source: entries/2026/05/29/unbundled-database-testtestervalidation.md

unbundled-wal-entries-ordered-by-lsn [IN] OBSERVATION

WAL entries in the unbundled database's _entries list are always in ascending LSN order, maintained by the sequential nature of append()

Source: entries/2026/05/29/unbundled-database-unbundleddatabase-truncatebefore.md

unbundled-wal-truncate-keeps-gte [IN] OBSERVATION

The unbundled database WAL cutoff is exclusive-below: entries with lsn >= cutoff are retained, entries with lsn < cutoff are discarded

Source: entries/2026/05/29/unbundled-database-unbundleddatabase-truncatebefore.md

unbundled-wal-truncate-memory-only [IN] OBSERVATION

WriteAheadLog.truncatebefore in the unbundled database removes entries from memory but does not modify the on-disk WAL file, so reloading from persistpath restores truncated entries

Source: entries/2026/05/29/unbundled-database-unbundleddatabase-truncatebefore.md

unbundled-wal-truncate-preserves-lsn [IN] OBSERVATION

Truncation in the unbundled database WAL never resets nextlsn, so appending after truncation continues with monotonically increasing LSNs

Source: entries/2026/05/29/unbundled-database-unbundleddatabase-truncatebefore.md

unfenced-server-accepts-all [IN] OBSERVATION

UnfencedResourceServer has no token parameter on write() and accepts all writes unconditionally, serving as the pedagogical unsafe baseline for comparison with FencedResourceServer

Source: entries/2026/05/29/fencing-tokens-fencing_tokens.md

unhandled-event-types-silently-skipped [IN] OBSERVATION

Events whose eventtype has no entry in the handlers dict are skipped without warning in both reconstructstate and Projection.catch_up

Source: entries/2026/05/29/event-sourcing-store-eventstore-reconstructstate.md

utf8-decode-unguarded [IN] OBSERVATION

If a segment contains a key with invalid UTF-8 bytes, scansegment raises an unhandled UnicodeDecodeError that aborts recovery for all subsequent segments.

Source: entries/2026/05/28/log-structured-hash-table-bitcask-scansegment.md

values-must-be-bytes [IN] OBSERVATION

B-tree values are written as raw bytes with no encoding; keys accept str (auto-encoded to UTF-8) or bytes, but values must already be bytes

Source: entries/2026/05/28/b-tree-storage-engine-btree-serializeleaf.md

vc-prune-is-lossy [IN] OBSERVATION

Pruning permanently discards causal information; subsequent compare() or dominates() calls treat pruned nodes as counter=0, which can produce false concurrency or false dominance

Source: entries/2026/05/29/vector-clocks-vector_clock-prune.md

vc-prune-keeps-highest-counters [IN] OBSERVATION

VectorClock.prune(n) retains the n entries with the highest counter values and discards the rest, using sorted descending by counter

Source: entries/2026/05/29/vector-clocks-vector_clock-prune.md

vc-prune-tiebreak-unstable [IN] OBSERVATION

When multiple nodes share the same counter value during prune, which survive depends on Python's sorted stability over dict iteration order — not a deterministic tiebreak policy callers can rely on

Source: entries/2026/05/29/vector-clocks-vector_clock-prune.md

vector-clock-compare-partial-order [IN] OBSERVATION

VectorClock.compare returns one of four string values — BEFORE, AFTER, EQUAL, or CONCURRENT — implementing a partial order where the symmetric property holds (BEFORE ↔ AFTER)

Source: entries/2026/05/29/vector-clocks-testvectorclock.md

vector-clock-immutability [IN] OBSERVATION

VectorClock is immutable: increment, merge, and prune all return new instances and never modify _clock in place.

Source: entries/2026/05/29/vector-clocks-vector_clock.md

verification-impossible-at-every-layer [IN] OBSERVATION

Verification of system correctness is impossible at every layer of the architecture: data integrity verification degrades from partial to absent along the storage pipeline (WAL CRCs exclude metadata, SSTables have no checksums at all), while protocol safety claims are unfalsifiable because all distributed testing uses synchronous simulation that cannot exercise the asynchronous failure modes the protocols are designed to tolerate.

verify-proof-direction-unvalidated [IN] OBSERVATION

The direction field in proof siblings is not validated; any value other than "left" is silently treated as "right" via the else branch.

Source: entries/2026/05/29/merkle-tree-merkletree-verifyproof.md

verify-proof-hash-consistency [IN] OBSERVATION

verifyproof concatenates hex-encoded hash strings and encodes to bytes before hashing, matching the exact same scheme used in init_ to build internal nodes — a mismatch would silently break all proof verification.

Source: entries/2026/05/29/merkle-tree-merkletree-verifyproof.md

verify-proof-is-pure-static [IN] OBSERVATION

verify_proof is a @staticmethod with no side effects; it requires no tree instance, only the data and a MerkleProof, so it can run on a different machine than the one that built the tree.

Source: entries/2026/05/29/merkle-tree-merkletree-verifyproof.md

verify-proof-root-trust [IN] OBSERVATION

verify_proof confirms data matches a given root hash but does not authenticate the root itself; callers must obtain a trusted root through an independent channel (e.g., a signed block header).

Source: entries/2026/05/29/merkle-tree-merkletree-verifyproof.md

version-scan-includes-unavailable [IN] OBSERVATION

put() determines the next version by scanning all replicas including those marked unavailable, which prevents version regression but couples version assignment to unavailable node state.

Source: entries/2026/05/29/read-repair-read_repair.md

versioned-store-sibling-semantics [IN] OBSERVATION

VersionedKVStore.get returns a list of VersionedValue entries; concurrent (causally unrelated) writes produce multiple siblings, and the store never auto-resolves conflicts — clients must call reconcile

Source: entries/2026/05/29/vector-clocks-testvectorclock.md

versioned-value-no-tombstone-flag [IN] OBSERVATION

The VersionedValue dataclass (dynamo.py:14-18) carries only value, version, and node_id with no field to distinguish a live value from a deletion marker, making correct distributed deletes impossible without schema changes

Source: entries/2026/05/29/topic-leaderless-deletion-gap.md

view-change-contagion [IN] OBSERVATION

Receiving a VIEWCHANGE from another node causes a non-primary node to broadcast its own VIEWCHANGE if it hasn't already, propagating the vote through the cluster without requiring all nodes to independently detect the faulty primary.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-handleview_change.md

view-change-requires-2f-plus-1 [IN] OBSERVATION

The new primary only acts on a view change after collecting at least 2f+1 VIEW_CHANGE messages, matching the standard PBFT quorum requirement for liveness.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-handleview_change.md

view-never-decreases [IN] OBSERVATION

handleviewchange silently drops any message with msg.view <= self.currentview, enforcing monotonically non-decreasing view progression across all nodes.

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-handleview_change.md

visibility-is-pure [IN] OBSERVATION

isvisible is a pure function with no side effects; it reads committed, aborted, tx.activeatstart, and version fields but never mutates any state.

Source: entries/2026/05/29/snapshot-isolation-mvccdatabase-is_visible.md

visibility-requires-three-conditions [IN] OBSERVATION

A version is visible to a transaction only if its creator committed, was not in the reader's activeatstart set, and has a lower tx_id — all three must hold

Source: entries/2026/05/29/topic-mvcc-snapshot-isolation.md

vnode-150-guarantees-sub-1.5-imbalance [IN] OBSERVATION

With 3 equally-weighted nodes and 150 vnodes each, loadimbalance() is asserted to stay below 1.5 (testconsistent_hashing.py:49).

Source: entries/2026/05/29/topic-virtual-node-count-tuning.md

wal-all-mutations-under-lock [IN] OBSERVATION

Every WAL method that writes records or modifies files (append, appendbatch, checkpoint, truncate) acquires self.lock before performing any I/O

Source: entries/2026/05/29/topic-wal-size-check-toctou.md

wal-append-mode-no-overwrite [IN] OBSERVATION

openlatest opens WAL files in "ab" (append-binary) mode, making it impossible for post-crash writes to overwrite pre-crash records; this is the bridge between recovery and forward progress

Source: entries/2026/05/29/topic-wal-crash-recovery-semantics.md

wal-append-no-validation [IN] OBSERVATION

append performs no validation of optype against OPBYTES; invalid operation names raise KeyError from the dictionary lookup, not a descriptive error.

Source: entries/2026/05/29/write-ahead-log-wal-append.md

wal-append-not-transactional [IN] OBSERVATION

Individual append calls are not wrapped in any transaction boundary; for atomic multi-operation writes, append_batch must be used instead.

Source: entries/2026/05/29/write-ahead-log-wal-append.md

wal-batch-adds-commit-record [IN] OBSERVATION

appendbatch of N items writes N+1 records to the WAL: the N data records plus a trailing record with optype == "COMMIT" that seals the batch

Source: entries/2026/05/28/write-ahead-log-test_wal.md

wal-batch-always-fsyncs [IN] OBSERVATION

append_batch() always force-fsyncs regardless of the configured sync mode, because batch atomicity requires the COMMIT marker to be durable before returning.

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-batch-atomicity-via-commit-record [IN] OBSERVATION

A batch write is only considered complete if its trailing COMMIT record is present and passes CRC; incomplete batches (missing commit) are discarded during replay

Source: entries/2026/05/28/topic-lsm-crash-recovery-wal-format.md

wal-batch-bug-is-performance-not-correctness [IN] OBSERVATION

The stale writecount after forced syncs causes extra fsyncs (premature batch threshold triggers) but never causes data loss — forced syncs always flush to disk regardless of counter state.

Source: entries/2026/05/29/topic-write-count-reset-semantics.md

wal-batch-commit-sentinel [IN] OBSERVATION

Every append_batch call writes exactly one COMMIT record (op code 3) as the final record in the batch, with empty key and value fields, consuming one additional sequence number.

Source: entries/2026/05/28/write-ahead-log-wal-append_batch.md

wal-batch-mode-loses-up-to-n-records [IN] OBSERVATION

In WAL batch sync mode, up to batchsynccount - 1 records (default 99) can be lost on kill -9 because os.fsync() is deferred until the batch threshold is reached; only append_batch() force-fsyncs regardless of mode.

Source: entries/2026/05/29/topic-crash-recovery-invariants.md

wal-batch-relies-on-practical-atomicity [IN] OBSERVATION

append_batch() buffers all operations plus COMMIT into a single write() call to minimize the partial-write window, but the code acknowledges this is not a true atomicity guarantee — a crash during the write can persist a prefix without the COMMIT marker

Source: entries/2026/05/28/topic-wal-atomicity-guarantees.md

wal-batch-single-file [IN] OBSERVATION

appendbatch writes the entire batch buffer in one fd.write call before mayberotate, so a batch never spans two WAL files under normal operation

Source: entries/2026/05/29/topic-wal-recovery-semantics.md

wal-batch-single-write [IN] OBSERVATION

append_batch() serializes all data records plus the trailing COMMIT marker into one bytearray and issues a single fd.write() call, relying on OS write atomicity for small batches

Source: entries/2026/05/28/topic-wal-commit-semantics.md

wal-batch-sync-count-controls-durability-window [IN] OBSERVATION

In batch mode, up to batchsynccount - 1 records may be lost on crash because they haven't been fsynced yet.

Source: entries/2026/05/29/write-ahead-log-wal-dosync.md

wal-batch-sync-force-override [IN] OBSERVATION

The WAL's dosync batch mode (fsync every N writes) is overridden by force=True in append_batch and checkpoint, ensuring atomic batch boundaries and checkpoint records are always fsynced regardless of configured sync mode.

Source: entries/2026/05/29/topic-bitcask-durability-tradeoffs.md

wal-binary-format-prevents-evolution-and-recovery [IN] OBSERVATION

The WAL binary format is simultaneously inflexible and fragile: contiguous record packing with no block alignment prevents resync after mid-file corruption, signed 32-bit length fields theoretically admit negative values with no guard, and there is no version field — a format change invalidates all existing WAL files with no migration path and no way to distinguish old-format from new-format records.

wal-checkpoint-consumes-sequence [IN] OBSERVATION

checkpoint() increments the WAL sequence counter by 1, occupying a position in the sequence number space alongside data records (e.g., after 6 data records at seq 1-6, currentseqnum() is 7 and checkpoint() returns 8).

Source: entries/2026/05/28/write-ahead-log-testertestwal.md

wal-checkpoint-forces-fsync [IN] OBSERVATION

checkpoint() calls dosync(force=True), making the checkpoint marker durable on disk regardless of the configured sync mode, so recovery boundaries are never lost even in "none" or "batch" mode.

Source: entries/2026/05/29/topic-fsync-durability-tradeoffs.md

wal-checkpoint-forces-sync [IN] OBSERVATION

checkpoint() calls dosync(force=True), making checkpoint markers always durable on disk regardless of the configured sync mode

Source: entries/2026/05/29/topic-bitcask-crash-recovery-semantics.md

wal-checkpoint-returns-seq [IN] OBSERVATION

WriteAheadLog.checkpoint() returns the sequence number it wrote, providing the caller the exact truncation boundary for the WAL-to-data-store coordination protocol

Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

wal-checkpoint-seq-is-next-after-data [IN] OBSERVATION

Checkpoint markers consume a sequence number in the same monotonic space as data records — a checkpoint after 7 data records occupies seq=8

Source: entries/2026/05/28/write-ahead-log-test_wal.md

wal-commit-clears-all-entries [IN] OBSERVATION

WAL.commit clears the entire WAL unconditionally via truncate; there is no partial commit or transaction grouping within a single WAL file

Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

wal-commit-sync-before-truncate [IN] OBSERVATION

WAL.commit always fsyncs the data file before truncating the WAL file; reversing this order would create a crash-safety hole where committed data could be lost

Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

wal-commit-syncs-metadata-implicitly [IN] OBSERVATION

WAL.commit() in the B-tree engine syncs metadata page 0 because PageManager.sync() fsyncs the single shared file descriptor that holds both data pages and metadata; there is no dedicated metadata fsync step

Source: entries/2026/05/29/topic-wal-commit-vs-metadata-sync-ordering.md

wal-concurrent-write-during-read-can-trigger-corruption-stop [IN] OBSERVATION

A concurrent append() writing to a file the iterator is reading can produce a partial record that readrecord interprets as CRC failure, terminating the entire iterator and dropping all remaining valid records across subsequent files

Source: entries/2026/05/29/topic-wal-concurrency-safety.md

wal-contiguous-no-block-alignment [IN] OBSERVATION

None of the WAL implementations use block-aligned or page-aligned record layouts; all records are packed contiguously with variable-length encoding, so torn writes can land at arbitrary byte offsets

Source: entries/2026/05/29/topic-block-aligned-wal-records.md

wal-corruption-returns-valid-prefix [IN] OBSERVATION

When the WAL encounters a corrupted record during replay, it stops and returns all structurally valid records preceding the corruption point rather than raising an exception to the caller.

Source: entries/2026/05/28/write-ahead-log-testertestwal.md

wal-corruption-stops-per-file [IN] OBSERVATION

recoverseq_num stops at the first corrupted record within each file but continues to subsequent WAL files, so cross-file recovery is resilient but intra-file records after corruption are silently lost

Source: entries/2026/05/29/topic-wal-recovery-semantics.md

wal-crc-does-not-cover-seqnum [IN] OBSERVATION

The CRC32 checksum covers only op_type + key + value; a corrupted sequence number would not be detected by integrity checking.

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-crc-includes-op-type [IN] OBSERVATION

The standalone WAL uniquely includes optypebyte in its CRC input alongside key and value, protecting against silent operation-type corruption (e.g., PUT flipped to DELETE) that the other CRC implementations leave undetected

Source: entries/2026/05/29/topic-crc-scope-comparison-across-implementations.md

wal-crc-mismatch-halts-all-replay [IN] OBSERVATION

readall_records catches CRC ValueError and stops iteration entirely (returns, not continues), so corruption in any file aborts reading of all subsequent files too.

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-default-segment-10mb [IN] OBSERVATION

WAL and hash-index Bitcask both default maxfilesize to 10 MB, while the log-structured Bitcask defaults to 1 MB — reflecting the latter's heavier reliance on compaction with hint files to offset recovery cost.

Source: entries/2026/05/29/topic-wal-segment-sizing-tradeoffs.md

wal-default-sync-mode-is-sync [IN] OBSERVATION

WriteAheadLog._init defaults syncmode to "sync", making per-write fsync the safe default that callers must explicitly opt out of for higher throughput

Source: entries/2026/05/29/topic-sync-mode-none-safety.md

wal-do-sync-branches-mutually-exclusive [IN] OBSERVATION

The dosync method's if/elif structure means forced syncs in batch mode skip all counter logic entirely — the counter is neither incremented nor reset, which is the structural root cause of the premature batch-sync bug.

Source: entries/2026/05/29/topic-write-count-reset-semantics.md

wal-do-sync-requires-caller-lock [IN] OBSERVATION

dosync does not acquire self._lock; thread safety depends on every caller holding the lock before invoking it.

Source: entries/2026/05/29/write-ahead-log-wal-dosync.md

wal-docstring-describes-intent-not-behavior [IN] OBSERVATION

The replay() docstring claims it "skips uncommitted batches" (describing the DDIA concept), but the inline comments and implementation show it returns all PUT/DELETE records regardless of COMMIT presence; the docstring is aspirational, the comments are accurate

Source: entries/2026/05/29/topic-wal-batch-atomicity-gap.md

wal-durability-tiers-lost-during-recovery [IN] OBSERVATION

The WAL's carefully calibrated write-time durability tiers (per-write fsync in sync mode vs. batch-only fsync) are completely invisible to recovery: sequence numbers and checkpoints that could distinguish between definitely-durable and potentially-lost records are never consulted, so replay treats all CRC-valid records identically regardless of whether they were fsynced — the write-time investment in durability is wasted.

wal-encode-is-pure [IN] OBSERVATION

encoderecord is a pure function that performs no I/O or state mutation; all disk writes (and fsync) are the responsibility of callers (append, append_batch, checkpoint, truncate).

Source: entries/2026/05/28/write-ahead-log-wal-encoderecord.md

wal-eof-advances-to-next-file [IN] OBSERVATION

An EOF or partial read within a single WAL file causes readall_records to advance to the next file rather than terminate the entire iterator

Source: entries/2026/05/29/write-ahead-log-wal-readall_records.md

wal-exactly-at-limit-rotates [IN] OBSERVATION

A WAL file whose size equals maxfilesize triggers rotation in open_latest; only files strictly smaller are reused.

Source: entries/2026/05/29/write-ahead-log-wal-openlatest.md

wal-file-size-managed-without-syscall-overhead [IN] OBSERVATION

WAL file size management avoids filesystem syscall overhead through tell()-based tracking with soft-limit semantics: the hot path uses fd.tell() instead of os.path.getsize() to avoid stat calls and TOCTOU races, the size limit is a soft cap checked before the next write rather than mid-write, and files at exactly the limit trigger rotation to prevent unbounded growth while allowing single-record overshoot.

wal-files-sorted-lexicographic [IN] OBSERVATION

walfiles() returns segment files sorted by filename ({N:06d}.wal), which equals chronological order since segment numbers increment monotonically

Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

wal-flush-truncate-replay-safety [IN] OBSERVATION

If a crash occurs between SSTable flush and WAL truncation, replay re-inserts already-persisted entries into the memtable, which is safe because memtable values shadow SSTable values during reads (last-writer-wins)

Source: entries/2026/05/29/log-structured-merge-tree-lsm-replay.md

wal-format-change-breaks-compatibility [IN] OBSERVATION

Changing the CRC input in encoderecord invalidates all existing WAL files; a production deployment would require a version byte and dual-path CRC verification during migration

Source: entries/2026/05/29/topic-seq-num-integrity-gap.md

wal-fsync-per-entry [IN] OBSERVATION

Each log_write call forces an os.fsync, guaranteeing per-entry durability at the cost of one sync syscall per logged page write.

Source: entries/2026/05/29/b-tree-storage-engine-btree-log_write.md

wal-has-no-before-images [IN] OBSERVATION

WALRecord stores only key and value (the new value) with no old_value field, making undo structurally impossible from the log alone

Source: entries/2026/05/29/topic-undo-logging-and-steal-policy.md

wal-has-threading-lock [IN] OBSERVATION

WriteAheadLog in wal.py:80 uses a threading.Lock to serialize append, append_batch, checkpoint, and truncate, making it the only concurrency-aware component in the codebase

Source: entries/2026/05/29/topic-btree-concurrency-control.md

wal-has-truncation-raft-does-not [IN] OBSERVATION

The WAL module provides explicit truncate(uptoseq) and file rotation for log lifecycle management; the Raft module has no equivalent despite both being append-only log abstractions

Source: entries/2026/05/29/topic-raft-log-compaction.md

wal-hot-path-avoids-getsize [IN] OBSERVATION

mayberotate uses self._fd.tell() instead of os.path.getsize(), avoiding filesystem stat calls and TOCTOU races on every append

Source: entries/2026/05/29/topic-wal-size-check-toctou.md

wal-is-redo-only [IN] OBSERVATION

The B-tree WAL uses redo-only recovery (replay logged page images forward); there is no undo log, so incomplete pre-commit writes in the data file are overwritten by WAL replay to restore consistency

Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

wal-is-source-of-truth [IN] OBSERVATION

The unbundled database's StorageEngine is fully derivable from the WriteAheadLog; calling rebuild() replays the entire WAL and reproduces identical state, making the log the authoritative record

Source: entries/2026/05/29/topic-ddia-ch12-unbundling.md

wal-iterate-preserves-all-ops [IN] OBSERVATION

iterate() yields every record including COMMIT and CHECKPOINT markers, providing the raw stream needed for commit-aware recovery logic built on top of replay()

Source: entries/2026/05/28/topic-wal-commit-semantics.md

wal-key-value-length-signed-int32 [IN] OBSERVATION

Key and value lengths in the WAL binary format are packed as signed 32-bit integers (struct format <i), limiting each field to ~2 GB and leaving negative lengths unguarded.

Source: entries/2026/05/28/write-ahead-log-wal-encoderecord.md

wal-little-endian-hardcoded [IN] OBSERVATION

The WAL binary record format uses little-endian byte order unconditionally (struct prefix <), making log files non-portable across architectures with different endianness.

Source: entries/2026/05/28/write-ahead-log-wal-encoderecord.md

wal-max-file-size-is-soft [IN] OBSERVATION

maxfilesize is a soft limit; a single batch can push a WAL file arbitrarily past it because mayberotate runs only after the write and sync complete

Source: entries/2026/05/29/topic-batch-atomicity-across-rotation.md

wal-max-file-size-is-soft-limit [IN] OBSERVATION

maxfile_size is a soft cap: the check triggers rotation for the *next* write, it does not prevent the current write from exceeding the limit.

Source: entries/2026/05/29/write-ahead-log-wal-mayberotate.md

wal-module-uses-locking-lsm-does-not [IN] OBSERVATION

The WAL module in write-ahead-log/wal.py uses self._lock for thread safety across its mutation paths, while the LSM tree module has zero locking or concurrency control despite having the same concurrent-access risks

Source: entries/2026/05/29/topic-superversion-refcount-implementation.md

wal-no-begin-marker [IN] OBSERVATION

The WAL protocol has no BEGIN record type, making it impossible during replay to distinguish a standalone PUT from the first record of a multi-record batch — the structural root cause of why COMMIT markers cannot enforce atomicity.

Source: entries/2026/05/29/topic-incomplete-batch-recovery.md

wal-no-commit-method [IN] OBSERVATION

The standalone WriteAheadLog class in write-ahead-log/wal.py has no commit method; sync-then-truncate coordination must be implemented by a higher-level storage engine that composes the WAL with a data file

Source: entries/2026/05/29/topic-wal-commit-protocol.md

wal-no-concurrent-group-commit [IN] OBSERVATION

The WAL uses a single threading.Lock and a write counter for batch sync rather than a concurrent waiter queue, so it cannot amortize fsync cost across concurrent callers the way PostgreSQL's group commit does.

Source: entries/2026/05/29/topic-group-commit-optimization.md

wal-no-directory-fsync [IN] OBSERVATION

The WAL implementation never fsyncs the parent directory after segment creation or deletion, meaning file metadata changes (including unlinks) are not guaranteed durable on crash on Linux

Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

wal-no-gap-detection [IN] OBSERVATION

Neither the WAL reader nor any visible consumer checks for sequence number gaps after replay, making silent data loss from mid-file corruption undetectable

Source: entries/2026/05/29/topic-wal-recovery-semantics.md

wal-no-nested-batches [IN] OBSERVATION

The WAL format has no batch-start marker and no nesting support; the lock held during append_batch prevents interleaving, making nested or concurrent batches structurally impossible

Source: entries/2026/05/29/topic-commit-aware-replay-design.md

wal-no-resync-after-corruption [IN] OBSERVATION

A corrupted recordlength in the WAL causes read_record to misframe all subsequent records; there is no magic-byte or scan-forward recovery mechanism to re-synchronize

Source: entries/2026/05/29/topic-wal-record-format-evolution.md

wal-no-rollback-on-io-failure [IN] OBSERVATION

If write() or fsync() fails mid-batch in append_batch, sequence numbers are already incremented with no rollback mechanism, creating a permanent gap in the sequence space.

Source: entries/2026/05/28/write-ahead-log-wal-append_batch.md

wal-no-torn-write-tests [IN] OBSERVATION

The WAL module has no tests for truncated records, CRC mismatches, or partial writes; the B-tree module tests CRC corruption explicitly but the WAL does not

Source: entries/2026/05/29/topic-partial-write-detection.md

wal-not-imported-outside-tests [IN] OBSERVATION

No production module imports the standalone WriteAheadLog; it is only referenced in testwal.py and testertest_wal.py, meaning crash-safe commit coordination using this module is unimplemented

Source: entries/2026/05/29/topic-wal-commit-protocol.md

wal-open-latest-init-only [IN] OBSERVATION

openlatest is called exclusively from _init, so the TOCTOU window between wal_files() and os.path.getsize() cannot be hit by concurrent WAL operations

Source: entries/2026/05/29/topic-wal-size-check-toctou.md

wal-partial-read-is-eof [IN] OBSERVATION

readrecord returns None on partial/short reads (torn writes), treated as EOF; this is distinct from CRC mismatch which raises ValueError — the distinction separates "crash during write" from "data corruption."

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-partial-truncate-idempotent-replay [IN] OBSERVATION

If the standalone WAL's truncate() crashes mid-iteration over files, un-processed files retain old records that will be replayed on recovery; this is safe because PUT and DELETE are idempotent against an already-current store

Source: entries/2026/05/29/topic-wal-crash-safety-gap.md

wal-protects-data-not-metadata [IN] OBSERVATION

The WAL protects user key-value writes but does not record SSTable-level state transitions (flush, compaction, level assignment), leaving metadata changes unprotected across crashes

Source: entries/2026/05/28/topic-manifest-based-compaction.md

wal-provides-crash-safety-not-concurrency [IN] OBSERVATION

The WAL syncs with os.fsync for durability guarantees only; it has no role in coordinating concurrent access, consistent with the single-threaded design

Source: entries/2026/05/28/topic-concurrent-btree-deletion.md

wal-read-record-stateless [IN] OBSERVATION

readrecord is a module-level function with no dependency on WriteAheadLog instance state, enabling its use during WAL construction/recovery before the object is fully initialized.

Source: entries/2026/05/28/write-ahead-log-wal-readrecord.md

wal-record-format-length-prefixed [IN] OBSERVATION

Each WAL record on disk is prefixed with a 4-byte little-endian uint32 length covering everything after itself, allowing the reader to atomically skip or validate entire records.

Source: entries/2026/05/28/write-ahead-log-wal-readrecord.md

wal-record-length-excludes-own-prefix [IN] OBSERVATION

The recordlength field counts 21 + len(key) + len(value) bytes — everything after the 4-byte length prefix itself — so total on-disk size per record is recordlength + 4.

Source: entries/2026/05/28/write-ahead-log-wal-encoderecord.md

wal-record-length-prefixed [IN] OBSERVATION

Every WAL record is prefixed with a 4-byte little-endian length covering all subsequent fields (CRC through value), enabling the reader to know exactly how many bytes to consume

Source: entries/2026/05/28/topic-lsm-crash-recovery-wal-format.md

wal-recover-on-every-startup [IN] OBSERVATION

BTree._init_ calls WAL.recover() unconditionally on every startup; an empty WAL short-circuits after a single read with no further I/O

Source: entries/2026/05/29/b-tree-storage-engine-btree-recover.md

wal-recover-seq-vs-read-all-differ [IN] OBSERVATION

recoverseqnum uses break on corruption (continues to next file) while readallrecords uses return (stops everything) — recovery is lenient to maximize the sequence counter, replay is strict to avoid replaying past corruption

Source: entries/2026/05/29/write-ahead-log-wal-readall_records.md

wal-recover-then-truncate [IN] OBSERVATION

WAL recovery fsyncs the data file via page_manager.sync() before truncating the WAL, so a crash during recovery itself leaves the WAL intact for re-replay

Source: entries/2026/05/29/b-tree-storage-engine-btree-recover.md

wal-recovery-contract-is-valid-prefix [IN] OBSERVATION

The WAL's recovery model guarantees a valid-prefix contract: both corruption (CRC mismatch halts replay returning prior valid records) and partial writes (treated as EOF, not error) cause replay to stop cleanly, always returning a consistent prefix of the written log.

wal-recovery-depends-on-listdir [IN] OBSERVATION

WAL segment discovery during recovery uses os.listdir(), so any segment whose parent directory entry was not fsynced becomes invisible after a crash — connecting the dir-fsync gap to concrete data loss.

Source: entries/2026/05/29/topic-fsync-on-new-file-creation.md

wal-recovery-infrastructure-is-vestigial [IN] OBSERVATION

The WAL carries vestigial recovery-related infrastructure: sequence numbers are monotonically assigned under lock and stored in every record but never consulted for ordering, deduplication, or gap detection during recovery (file position alone determines replay order), and checkpoint records occupy positions in the same sequence space but are neither searched for during recovery scanning nor used as truncation watermarks.

wal-recovery-scans-full-file [IN] OBSERVATION

recoverseq_num in write-ahead-log/wal.py reads every record in every WAL file sequentially from byte zero; there is no seek-to-end or block-skip optimization, making recovery O(file-size)

Source: entries/2026/05/29/topic-block-aligned-wal-records.md

wal-replay-ignores-commit [IN] OBSERVATION

replay() returns all PUT/DELETE records regardless of whether they are followed by a COMMIT record, meaning uncommitted batches are replayed identically to committed ones.

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-replay-no-atomicity-check [IN] OBSERVATION

replay() does not verify that batch operations have a corresponding COMMIT record; partial batches from a mid-write crash would be replayed as if committed, since replay filters only by record type, not by batch completeness

Source: entries/2026/05/28/topic-wal-commit-semantics.md

wal-rotate-fsync-before-close [IN] OBSERVATION

_rotate always fsyncs the outgoing segment before closing it, ensuring no buffered writes are lost during segment rotation.

Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

wal-rotate-no-gap-protection [IN] OBSERVATION

If a segment file is manually deleted from the directory, _rotate can produce a filename that reuses a previously-used number, since it derives the next name solely from the current highest filename.

Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

wal-rotation-is-post-write [IN] OBSERVATION

Both append and appendbatch call maybe_rotate() after write and sync complete; rotation never interrupts an in-progress write, and the file only rotates on the next operation

Source: entries/2026/05/29/topic-batch-atomicity-across-rotation.md

wal-rotation-monotonicity-untested [IN] OBSERVATION

test_rotation asserts record count after rotation but does not verify that sequence numbers are monotonically increasing across file boundaries — the cross-file monotonicity invariant holds by construction but has no regression test.

Source: entries/2026/05/29/topic-multi-file-replay-ordering.md

wal-segment-naming-zero-padded [IN] OBSERVATION

WAL segment filenames are zero-padded 6-digit integers (000001.wal, 000002.wal, ...) derived from the highest existing filename plus one, which makes lexicographic sort equal numeric sort.

Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

wal-seq-num-global-monotonic [IN] OBSERVATION

seqnum is a single in-memory counter that only increments via += 1; all records across all WAL files share one monotonically increasing sequence space that never resets, even across file rotation.

Source: entries/2026/05/29/topic-multi-file-replay-ordering.md

wal-seq-num-is-monotonic-not-page-lsn [IN] OBSERVATION

WAL seq_num increases monotonically across all records but is not stamped onto data pages, so there is no mechanism to detect whether a replayed operation was already applied — making replay non-idempotent against the underlying store.

Source: entries/2026/05/29/topic-redo-vs-undo-logging.md

wal-seq-num-recovered-on-init [IN] OBSERVATION

On construction, recoverseq_num() scans all WAL files to find the maximum surviving sequence number, guaranteeing the monotonic counter never goes backward across crashes

Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

wal-seq-nums-are-vestigial [IN] OBSERVATION

WAL sequence numbers are monotonic and strictly increasing but entirely vestigial: they are computed under lock, stored in every record, and parsed during recovery, yet never consulted for ordering, deduplication, or gap detection — file position alone determines replay order.

wal-seq-nums-strictly-monotonic [IN] OBSERVATION

Sequence numbers increment under a threading.Lock and are never reused; on recovery, all WAL files are scanned to find the high-water mark so new records continue the sequence.

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-seq-reset-on-commit [IN] OBSERVATION

The WAL sequence counter resets to 0 on every commit; sequence numbers are only meaningful within a single uncommitted transaction window

Source: entries/2026/05/28/b-tree-storage-engine-btree-commit.md

wal-seq-starts-at-one [IN] OBSERVATION

Sequence numbers begin at 1 for the first appended record; an empty WAL reports currentseqnum() == 0

Source: entries/2026/05/28/write-ahead-log-test_wal.md

wal-seq-unused-in-recovery [IN] OBSERVATION

The 4-byte sequence number field in each WAL entry is parsed during recover() but never consulted; replay order is determined solely by file offset

Source: entries/2026/05/29/b-tree-storage-engine-btree-recover.md

wal-single-writer-fd [IN] OBSERVATION

At most one file descriptor is open for WAL writes at any time; _rotate closes the old fd before opening the new one.

Source: entries/2026/05/29/write-ahead-log-wal-_rotate.md

wal-single-writer-thread-level [IN] OBSERVATION

The WAL enforces single-writer via threading.Lock but has no inter-process locking mechanism (no flock/PID file), so the single-writer invariant holds only within a single OS process

Source: entries/2026/05/29/topic-wal-size-check-toctou.md

wal-size-check-uses-disk [IN] OBSERVATION

openlatest checks on-disk file size via os.path.getsize, not an in-memory counter, making it correct across crash/restart boundaries.

Source: entries/2026/05/29/write-ahead-log-wal-openlatest.md

wal-sync-mode-default-is-sync [IN] OBSERVATION

WriteAheadLog defaults to sync_mode="sync", calling flush() + os.fsync() after every single append call — the safest and slowest mode.

Source: entries/2026/05/29/topic-wal-sync-modes.md

wal-sync-mode-survives-crash [IN] OBSERVATION

With sync_mode="sync", WAL records are durable immediately after append returns — they survive process death even if close() is never called.

Source: entries/2026/05/28/write-ahead-log-testertestwal.md

wal-sync-mode-unvalidated [IN] OBSERVATION

WriteAheadLog accepts any string as sync_mode without validation; values other than "sync" and "batch" silently disable all fsync.

Source: entries/2026/05/29/write-ahead-log-wal-dosync.md

wal-thread-safety-stranded-by-callers [IN] OBSERVATION

The WAL carefully serializes all mutations (append, appendbatch, checkpoint, truncate) under a threading.Lock, but both callers that depend on it have no synchronization for their own shared state: the LSM tree mutates memtable, sstables, and immutablememtables without any locking or atomic swaps, and sstables specifically is mutated by both flush (append) and compact (full replacement) concurrently — the lock protects the log but not the data structures built from it.

wal-truncate-blocks-all-operations [IN] OBSERVATION

truncate holds self._lock for its entire duration (close, rewrite, reopen), blocking all concurrent appends, replays, and iterates

Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

wal-truncate-closes-write-fd [IN] OBSERVATION

truncate() flushes, fsyncs, and closes the current write file descriptor before iterating segments, preventing conflicts with files it may need to delete or rewrite

Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

wal-truncate-deletes-empty-files [IN] OBSERVATION

WAL files where every record has seqnum <= upto_seq are deleted entirely via os.remove rather than left as empty files on disk

Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

wal-truncate-deletes-oldest-first [IN] OBSERVATION

truncate() iterates segments via walfiles() in oldest-first order, so a crash mid-truncation leaves a contiguous suffix of segments — preserving the recovery invariant that surviving files form a continuous sequence

Source: entries/2026/05/28/topic-wal-segment-deletion-ordering.md

wal-truncate-drops-after-corruption [IN] OBSERVATION

During truncate, if a corrupt record is encountered in a WAL file, all subsequent records in that file are silently discarded regardless of their sequence number — even records with seqnum > upto_seq

Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

wal-truncate-fsyncs-before-scan [IN] OBSERVATION

truncate() flushes and fsyncs the current WAL file before scanning for records to remove, preventing data loss from buffered writes that haven't reached disk

Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

wal-truncate-inclusive-boundary [IN] OBSERVATION

truncate(n) removes records with seq_num <= n (inclusive upper bound), keeping only records strictly greater than n

Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

wal-truncate-is-inclusive [IN] OBSERVATION

truncate(seq) removes all records with sequence number less than or equal to seq, keeping only records where seq_num > seq

Source: entries/2026/05/28/write-ahead-log-test_wal.md

wal-truncate-no-bounds-check [IN] OBSERVATION

truncate does not validate that uptoseq is within range — passing a value beyond currentseqnum() silently deletes all records without error

Source: entries/2026/05/28/write-ahead-log-wal-truncate.md

wal-truncate-not-crash-safe [IN] OBSERVATION

truncate() rewrites WAL files in place without atomic rename, so a crash during truncation can leave the log in an inconsistent state.

Source: entries/2026/05/28/write-ahead-log-wal.md

wal-truncate-preserves-records-above-seq [IN] OBSERVATION

WriteAheadLog.truncate(uptoseq) keeps records with seqnum > upto_seq and deletes only those at or below, enabling partial log reclamation tied to checkpoint boundaries.

Source: entries/2026/05/29/topic-wal-checkpoint-protocol.md

wal-truncate-requires-explicit-seq [IN] OBSERVATION

truncate(uptoseq) takes an explicit sequence number parameter; the WAL never decides what to truncate on its own — the caller must provide the boundary

Source: entries/2026/05/29/topic-log-structured-checkpoint-coordination.md

wal-truncate-requires-prior-checkpoint [IN] OBSERVATION

WAL truncation is only safe because it is called after the data the WAL protects has been durably written to the main store (SSTable flush or data file fsync); violating this ordering loses committed data with no recovery path

Source: entries/2026/05/29/topic-wal-crash-safety-gap.md

wal-truncate-rewrites-files [IN] OBSERVATION

WriteAheadLog.truncate() reads and rewrites every segment file record-by-record rather than deleting whole segment files, making it O(total records) instead of O(segments)

Source: entries/2026/05/29/topic-log-compaction-vs-persistence.md

wal-truncation-is-multi-failure-hazard [IN] OBSERVATION

WAL truncation combines three independent failure modes: it blocks all concurrent operations for its entire duration, silently discards all records after encountering corruption in a file, and is not crash-safe due to in-place file rewriting without atomic rename.

wal-truncation-vs-corruption-distinction [IN] OBSERVATION

readrecord returns None for short reads (truncation) but raises ValueError for CRC mismatch (corruption), giving callers two distinct failure modes to handle differently during recovery

Source: entries/2026/05/29/topic-partial-write-detection.md

wal-two-tier-durability-model [IN] OBSERVATION

The WAL provides two distinct durability levels: individual appends respect the configured sync mode (potentially skipping fsync entirely), while batch operations always force fsync, meaning batch writes are strictly more durable than individual writes regardless of configuration.

wal-uses-crc32-not-sha [IN] OBSERVATION

WAL integrity uses zlib.crc32 (32-bit, non-cryptographic); it detects accidental corruption but not intentional tampering

Source: entries/2026/05/28/topic-wal-checksum-format.md

write-meta-no-fsync [IN] OBSERVATION

PageManager.writemeta calls flush() but not os.fsync(), so metadata updates (nextfreepage, freelisthead) are not durable against power loss — compounding the WAL bypass with a durability gap on the direct write path.

Source: entries/2026/05/29/topic-free-list-corruption-risks.md

write-page-silently-truncates [IN] OBSERVATION

PageManager.writepage truncates data exceeding pagesize to exactly pagesize bytes without raising an error, which would corrupt the node header's numkeys count

Source: entries/2026/05/29/topic-page-overflow-and-size-limits.md

write-skew-has-no-default-tests [IN] OBSERVATION

The write-skew-detection module's tests exist only in testertestssi.py, making it entirely untested under default pytest invocation since that filename does not match the test*.py / *test.py globs

Source: entries/2026/05/29/topic-pytest-default-collection.md

write-time-durability-engineering-abandoned-at-recovery [IN] OBSERVATION

The WAL's carefully engineered write-time durability infrastructure is systematically abandoned at recovery time: the two-tier durability model (per-write sync vs. batch-only fsync) loses all distinction during replay because recovery ignores tiers entirely, AND crash recovery is both broken (no safe path across any implementation) and unverified (no crash or async tests), meaning the engineering investment in write-time safety provides zero value when it is most needed.

wrong-digest-cannot-reach-quorum [IN] OBSERVATION

PREPARE and COMMIT messages with non-matching digests are silently dropped and never count toward quorum thresholds; a Byzantine node sending bad digests cannot contribute to agreement

Source: entries/2026/05/29/topic-byzantine-fault-tolerance.md

wrong-digest-is-node-specific [IN] OBSERVATION

WRONGDIGEST mode produces "baddigest{nodeid}", so two Byzantine nodes with this mode produce different invalid digests rather than accidentally colluding on the same forged value

Source: entries/2026/05/29/byzantine-fault-tolerance-pbft-applybyzantine.md

zero-entries-stripped [IN] OBSERVATION

VectorClock._init_ strips all entries with value 0, ensuring two clocks with identical non-zero entries are always equal and hash-equal.

Source: entries/2026/05/29/vector-clocks-vector_clock.md

zip-truncation-risk [IN] OBSERVATION

If keys and values lists passed to serializeleaf have mismatched lengths, zip silently truncates to the shorter list while num_keys in the header reflects the longer, producing a corrupt page

Source: entries/2026/05/28/b-tree-storage-engine-btree-serializeleaf.md

zlib-crc32-is-iso3309 [IN] OBSERVATION

All 13 CRC call sites across the codebase use zlib.crc32, which implements the ISO 3309 polynomial (0xEDB88320); no module uses the Castagnoli polynomial (CRC-32C) that RocksDB and PostgreSQL prefer for hardware-accelerated checksumming.

Source: entries/2026/05/29/topic-crc32-vs-crc32c.md

zlib-crc32-supports-chaining-natively [IN] OBSERVATION

Python's zlib.crc32(data, initial_value) accepts an initial CRC value, meaning chained checksums could be implemented in this codebase by passing the previous frame's CRC as the seed with no additional hashing infrastructure

Source: entries/2026/05/29/topic-cumulative-checksums.md

2pc-recovery-reaches-terminal-state [OUT] OBSERVATION

Two-phase commit recovery drives all interrupted transactions to a terminal state by replaying committed decisions to participants, with lock ownership guards preventing cross-transaction interference during the recovery process.

avro-schema-evolution-handles-version-mismatch [OUT] OBSERVATION

Avro schema evolution correctly handles writer-reader version mismatches through mandatory dual-schema resolution (writer schema parses wire bytes, reader schema shapes output), with clean separation between structural schema errors at parse time and compatibility errors at resolution time.

bitcask-compaction-preserves-state [OUT] OBSERVATION

Bitcask compaction maintains identical observable behavior: every key returns the same value before and after compaction.

bloom-filter-sizing-is-production-ready [OUT] OBSERVATION

The Bloom filter implementation uses textbook-optimal bit array and hash count formulas, making it ready for production integration.

btree-page-alignment-enables-corruption-recovery [OUT] OBSERVATION

Fixed-size page addressing provides natural resync boundaries for recovering from data file corruption.

btree-range-scan-provides-ordered-access [OUT] OBSERVATION

B-tree range scans provide correct ordered sequential access by walking the actively maintained sibling chain with support for unbounded end keys, traversing all matching keys in sorted order.

btree-splits-are-crash-safe [OUT] OBSERVATION

Multi-page B-tree operations (splits, deletes) are made crash-safe by writing all modifications to the WAL before applying them to data pages.

consistent-hash-routing-correct-under-single-thread [OUT] OBSERVATION

Consistent hash routing provides correct minimal-redistribution key assignment with proper deduplication: adding an Nth node moves approximately 1/N of keys (not a full reshuffle), and the preference list correctly skips virtual nodes of already-seen physical nodes to return exactly replication_factor distinct physical nodes.

crdt-convergence-practically-sustainable [OUT] OBSERVATION

CRDT merge convergence is both algebraically correct and practically sustainable in production with bounded resource consumption.

crdt-merge-algebra-satisfies-convergence-requirements [OUT] OBSERVATION

All four CRDT types satisfy the algebraic properties required for strong eventual convergence: merge is idempotent (re-merging produces no change), equality compares semantic state rather than object identity (enabling correct convergence checks), and ORSet tombstones grow monotonically (preventing element resurrection after removal).

derived-systems-maintain-consistency-when-position-durable [OUT] OBSERVATION

Derived systems (secondary indexes, materialized views, projections) can maintain consistency with their source through position-tracked CDC replay: every derived system tracks how far through the CDC log it has processed, and any derived system can be rebuilt from scratch via full event replay producing identical state to incremental processing.

distributed-layer-has-three-incompatible-convergence-models [OUT] OBSERVATION

The distributed layer uses three fundamentally incompatible convergence and resolution models with no unifying bridge: CRDTs encode resolution algebraically in merge semantics, the strategy pattern selects between LWW and custom resolution at runtime, AND consensus and membership use irreconcilable strong-leader (Raft) vs. eventual-consistency (gossip) models — composing correct end-to-end behavior requires manually bridging paradigms designed in isolation.

distributed-layer-incoherent-and-unachievable [OUT] OBSERVATION

The distributed layer is both internally incoherent and externally unachievable: convergence and ordering models are mutually incompatible across modules (CRDTs encode algebraic resolution, LWW uses wall-clock tiebreaking, gossip uses epidemic thresholds) AND correctness is doubly unachievable under network partitions (storage-layer guarantees are unmet, write correctness gaps are amplified by partition-induced failure detection breakdowns).

distributed-models-incompatible-at-convergence-and-ordering [OUT] OBSERVATION

The distributed layer has mutually incompatible models at two independent levels: convergence mechanisms (CRDTs encode algebraic resolution, LWW uses timestamp comparison, Raft requires strong leader authority, gossip relies on epidemic propagation) have no unifying bridge, AND ordering models (Lamport clocks provide total order via node-ID tiebreaking, vector clocks provide partial order with incomparable states, hash indexes use non-monotonic wall-clock time) are fundamentally incompatible across modules.

dynamo-read-repair-ensures-replica-consistency [OUT] OBSERVATION

Eager all-node read repair after every quorum read ensures all replicas converge to the latest version, because repair propagates to all reachable nodes (not just quorum participants) and per-key versioning prevents false cross-key conflicts.

event-sourcing-state-is-fully-reconstructible [OUT] OBSERVATION

Event sourcing state can be fully reconstructed from the event log: projections are stateless replay functions (cache, not source of truth), and snapshots use deep copy to isolate stored state from mutations.

fencing-provides-linearizable-writes [OUT] OBSERVATION

Fencing tokens provide linearizable write protection at the resource server: stale tokens are permanently rejected (no expiration) and the monotonic token ordering ensures only the most recent lock holder can successfully write.

hash-index-survives-restart [OUT] OBSERVATION

Hash-index storage correctly reconstructs its in-memory index from on-disk segments after restart by scanning in ascending order so newer writes overwrite older entries, with fsync-controlled durability for each write.

hinted-handoff-ensures-write-availability [OUT] OBSERVATION

Hinted handoff maintains write availability during replica failures by routing writes to substitute nodes that forward data when the original replica recovers.

lamport-mutex-provides-mutual-exclusion [OUT] OBSERVATION

The Lamport mutex correctly ensures mutual exclusion: the requesting node with the lowest timestamp gets priority, and entry requires acknowledgment from every other node in the system, preventing any two nodes from entering the critical section simultaneously.

lsm-compaction-output-is-correct [OUT] OBSERVATION

LSM compaction produces correct merged output: tombstones are properly purged from the output SSTable, and the k-way merge is forward-only preserving sort order.

lsm-compaction-safely-reduces-read-amplification [OUT] OBSERVATION

LSM compaction correctly reduces read amplification by merging all SSTables into a single output with last-writer-wins dedup: the full-merge strategy eliminates all redundant entries and the newest value always wins during conflict resolution, producing a minimal SSTable set.

lsm-read-path-correct-across-flushes [OUT] OBSERVATION

The LSM read path maintains correctness across memtable flushes by searching newest-first (memtable then SSTables in reverse sequence order) using a reference swap rather than deep copy for the frozen memtable.

lsm-wal-provides-crash-recovery [OUT] OBSERVATION

The LSM WAL replays on construction to recover unflushed memtable state, providing crash recovery for in-flight writes.

multi-leader-convergence-reliable-across-topologies [OUT] OBSERVATION

Multi-leader replication achieves reliable eventual convergence regardless of network topology: sync uses a safe collect-then-distribute pattern with idempotent merge and monotonically advancing timestamps, and topology choice affects only the duration of observable divergence (linear in node count for ring, single round for all-to-all), not the final converged state.

mvcc-isolation-survives-restart [OUT] OBSERVATION

MVCC's three-layer visibility model (append-only versions, own-writes visibility, symmetric deletion) with active sets frozen at transaction begin would provide correct snapshot isolation across process restarts if all state were persisted, since the visibility rules themselves are sound and compose correctly.

pbft-view-change-safety-holds-with-stable-digests [OUT] OBSERVATION

PBFT view changes maintain safety (2f+1 agreement) and liveness (carrying prepared requests forward with contiguous sequence numbers into the new view), ensuring no committed request is lost across leader transitions.

ssi-isolation-holds-under-single-thread [OUT] OBSERVATION

SSI's composed invariant layers provide correct serializable snapshot isolation: MVCC enforces visibility through append-only versions and symmetric deletion rules, and SSI validation ensures own-writes are checked before store consultation, preventing lost updates.

sstable-point-lookup-correct-under-sort-invariant [OUT] OBSERVATION

SSTable point lookups return correct results through two-phase search (binary search on sparse index to identify block, then linear scan within block), with clean None returns for keys not present in the table.

tob-ordering-verified-under-real-failures [OUT] OBSERVATION

Total Order Broadcast maintains identical delivery ordering across node failures: recovered nodes deliver the same slot sequence as live nodes, confirming that per-slot Paxos consensus and contiguous slot delivery enforce a single global order even through failure and recovery.

two-incompatible-conflict-resolution-paradigms [OUT] OBSERVATION

The codebase contains two mathematically incompatible conflict resolution paradigms with no bridge between them: CRDTs provide algebraically proven convergence (idempotent, commutative merge satisfying SEC requirements), while the multi-leader strategy pattern offers LWW and custom-merge without formal convergence guarantees, and no mechanism exists to compose or translate between them.

unbundled-catchup-produces-consistent-derived-state [OUT] OBSERVATION

Catch-up via snapshot and streaming produces consistent derived-system state equivalent to full event replay, because WAL entries are ordered by LSN and the rebuild protocol is verified to match full replay output.

wal-batch-replay-provides-atomicity [OUT] OBSERVATION

WAL batch replay correctly identifies and rejects incomplete batches via the trailing COMMIT sentinel, providing atomic-or-nothing batch recovery semantics.

wal-durability-effective-on-standards-compliant-fsync [OUT] OBSERVATION

The WAL's two-tier durability model provides effective crash protection: critical operations (checkpoints, batch commits, rotations) unconditionally force fsync while per-write sync respects the configured mode, creating a meaningful durability hierarchy where batch boundaries are always durable regardless of the performance/durability tradeoff chosen for individual writes.

wal-is-reliable-source-of-truth [OUT] OBSERVATION

The WAL serves as the authoritative source from which the entire storage engine can be rebuilt via replay.

wal-multi-segment-continuity-is-reliable [OUT] OBSERVATION

Multi-segment WAL replay provides reliable cross-segment continuity: segment rotation fsyncs the outgoing file before closing it, and EOF within one segment advances to the next rather than terminating replay, ensuring no inter-segment data loss during normal operation.

wal-sequence-numbers-enable-ordered-recovery [OUT] OBSERVATION

WAL sequence numbers provide infrastructure for ordered recovery: they are recovered from disk on init, checkpoints consume positions in the same monotonic space, and the total ordering could enable gap detection and ordered replay.

wal-truncation-safe-when-linear [OUT] OBSERVATION

WAL truncation maintains data safety through two complementary mechanisms: segments are processed in oldest-first order (so a crash mid-truncation leaves a contiguous suffix of segments), and the current WAL file is flushed and fsynced before scanning for records to remove (preventing buffered data from being lost).

Topics

wal (177)
btree (79)
bitcask (55)
lsm (55)
sstable (49)
compaction (46)
range (42)
hash (40)
index (40)
page (40)
event (36)
fsync (36)
scan (36)
uses (34)
read (32)
recovery (32)
write (32)
commit (31)
merge (31)
hint (30)