When any participant is unavailable, execute() returns "aborted" and no participant's state is modified, even for participants that were available and could have committed independently
The coordinator logs "committing" before any participant receives the commit decision, creating a window where a crash leaves participants locked indefinitely in the "prepared" state with no authority to self-resolve.
A participant that has voted YES in prepare() holds locks and cannot unilaterally commit or abort until it receives the coordinator's decision, creating an unbounded blocking window if the coordinator fails
Coordinator.recover() can only re-send decisions to participants that are currently available (is_available() check), leaving unavailable participants still in-doubt with no resolution path
A coordinator with a "committing" log entry will re-send commit decisions to all participants during recover(), ensuring that a crash after the commit decision but before delivery still completes the transaction
Two-phase commit has a systematic gap between its safety design and implementation: the coordinator has a known blocking window between logging and sending the decision, timeouts are accepted as a parameter but never enforced, and recovery requires all participants to be available — defeating the protocol's availability goals.
Participant.abort() only releases locks where self.locks.get(op["key"]) == tx_id, preventing one transaction's abort from releasing another transaction's lock
Locks acquired during prepare() are only released by commit() or abort(), meaning an in-doubt transaction blocks all subsequent transactions on the same keys indefinitely until the coordinator recovers
Locks held by a transaction are released regardless of whether the transaction commits or aborts, preventing deadlocks across sequential transactions
The coordinator appends "committing" or "aborting" to its log before broadcasting the decision to participants, enabling crash recovery via recover() to re-send decisions that were never delivered
Participant.recover() identifies transactions stuck in "prepared" state but provides no mechanism to resolve them without the coordinator, making it a diagnostic tool rather than a recovery procedure
Coordinator.recover() scans the log once and re-sends decisions to currently-available participants; it does not retry or schedule follow-ups, so a participant still down at recovery time stays locked until recover() is manually called again.
The 2PC protocol communicates all outcomes (commits, aborts, lock conflicts, unavailability) via {"outcome": "committed"|"aborted", "reason": ...} return dicts; no method raises exceptions
Two-phase commit's design-implementation gaps (known blocking window with no timeout enforcement, recovery requiring participant availability) are validated only under synchronous simulation where messages arrive instantly, meaning the real-world impact of coordinator crashes during the blocking window — where participants hold locks indefinitely — remains untested.
The timeout parameter is accepted by Coordinator._init but never referenced in any logic; availability is modeled via boolean available flags on participants instead of actual time-based timeouts
A transaction commits only if every participant in participant_operations votes "yes"; a single "no" vote triggers abort for all participants regardless of how many voted yes
Participants lock at key granularity during prepare(); a second transaction touching a locked key receives a "no" vote and aborts rather than waiting or queuing
Aborting a transaction in both MVCC and SSI implementations sets a status flag (_aborted set or status marker); no disk writes are reversed because uncommitted data never reached disk
A version created by an aborted transaction is invisible to every transaction, and a deletion by an aborted transaction is ignored — aborted transactions' effects are completely erased from all snapshots.
add_node mutates the ring after each vnode insertion, so subsequent vnodes in the same loop see predecessors from earlier iterations, preventing overlapping transfer arcs.
Each addnode call performs vnodecount list insertions into a sorted list, each O(total ring entries), making node addition O(V × N×V) in the worst case.
The oldowner != nodeid guard at consistent_hashing.py:38 skips transfer recording when a new vnode's successor is another vnode of the same node, avoiding double-counting already-claimed arcs.
advance_time(timestamp) lets an external orchestrator push the watermark forward without receiving an actual event, enabling idle-stream handling and explicit trigger semantics; it rejects timestamps at or below the current watermark
None of the 13 os.fsync() call sites have callers that depend on mtime or ctime metadata being accurate; all syncs exist purely for data durability, making every site a valid fdatasync candidate
The WAL, Bitcask, and B-tree storage engine all compute CRC over data payloads only, excluding their respective header/framing metadata fields — a consistent pattern across the repo
ALLTOALL topology delivers every pending change to every other node in a single sync() call, achieving convergence in one round when no custom-merge cascades occur.
allocate_page returns a page number containing whatever stale data was previously on disk; the caller must overwrite it with valid page content before the next WAL commit
allocate_page always recycles from the free list before extending the file with a new page, keeping the data file compact after deletions
The repair count returned by antientropyrepair() reflects individual node writes, not unique keys repaired; one stale key across N nodes counts as N repairs
Anti-entropy can precisely locate divergent key ranges via Merkle tree diffs but cannot fully reconcile them: tombstone semantics differ at every layer (empty-bytes sentinel in LSM, preserved-by-default in merge, replication-convergence-dependent in distributed), preventing consistent cross-replica resolution of deleted keys.
antientropyrepair() propagates existing VersionedValue entries without modifying versioncounters, so it never creates new versions — only copies the highest existing version to lagging replicas
Anti-entropy repair assumes a total ordering on versions where the highest version number is authoritative, which is safe because put() uses a global per-key version counter ensuring no two writes produce the same version
The gossip, merkle-tree, vector-clock, read-repair, and hinted-handoff modules are independent implementations with no cross-imports; composing them into a full Dynamo-style anti-entropy pipeline is left to the integrator
Both ReadRepairStore.antientropyrepair() and DynamoCluster.antientropyrepair() only scan and repair currently available replicas; a down replica must rely on a future anti-entropy run after it recovers.
Anti-entropy repair only reads from and writes to nodes where is_available is true; down nodes are excluded from both key discovery and repair propagation
antientropyrepair() collects keys from all available nodes via set union, so a key present on any single node is propagated to all other nodes regardless of whether it was intentionally removed elsewhere
macOS APFS provides implicit rename durability through its CoW transaction model, so the missing directory fsync is a latent bug that only surfaces on Linux filesystems (ext4, XFS)
If an exception occurs mid-batch in appendbatch (e.g., disk I/O failure in persistevent), already-appended events remain in events, _streams, and on disk with no rollback — the process continues with inconsistent in-memory state
Returned Event objects from appendbatch are the same instances stored in events; no defensive copy is made for either the event or its data dict, so caller mutation corrupts store state.
appendbatch checks expectedversion once before the write loop; it does not re-check after each event, so the guard is a pre-condition gate that rejects the entire batch cleanly on mismatch, not a per-event invariant
applybyzantine is applied exclusively to outgoing messages; incoming messages are never filtered through it, so a Byzantine node's internal state (prepared/committed sets) remains consistent even while it sends corrupted messages
applyremotechange always advances the node's Lamport clock to max(localclock, remotets) + 1 before any conflict resolution, ensuring subsequent local writes have causally-later timestamps than any received change.
applyremotechange does not validate that mergefn is non-None when strategy is CUSTOMMERGE; passing None raises TypeError at call time rather than a descriptive error.
With asyncindex=True, TermPartitionedDB queues index operations in pending instead of applying them immediately; flush_index() must be called to drain the queue, modeling asynchronous global index updates.
Consumer defaults to auto_commit=False (line 182), requiring explicit commit() calls; at-least-once semantics via auto-commit are opt-in, not the default behavior
Arrays and maps are encoded as a sequence of count-prefixed blocks terminated by a 0-count block; counts and elements use zigzag+varint via writelong/readlong
The AvroEncoder always writes positive counts for array/map blocks, never negative counts with byte-size prefixes, trading skip efficiency for encoder simplicity
Avro's binary encoding systematically prioritizes compactness over self-description: the format contains no type tags or field names (requiring both schemas for decoding), arrays use zero-terminated blocks for streaming, and negative block counts enable O(1) skipping — all optimizations that assume schema availability.
When a writer sends an enum symbol not in the reader's symbol list, the reader falls back to the default symbol declared in the reader schema rather than raising an error
The Avro int type enforces 32-bit signed integer range ([-2^31, 2^31-1]) at encode time via ValueError, while long accepts values at least as large as 2^40
During record resolution, a reader field absent from the writer schema must have a default value; missing defaults cause SchemaCompatibilityError, making defaults the mechanism for forward/backward compatibility.
A negative block count -N in Avro's spec means N elements follow, preceded by a byte-size long that enables O(1) skipping of the entire block without parsing elements
NODEFAULT is a sentinel object distinguishing "no default provided" from a None/null default; this matters because null is a valid Avro default value, and conflating the two would break canonical form computation and default-filling during resolution.
Avro binary encoding contains no type tags, field names, or schema metadata; decoding is impossible without the writer schema.
Type promotion during schema resolution is strictly one-directional (int→long→float→double, int→float, int→double, long→double); reverse promotion (e.g., long→int) is not supported and raises SchemaCompatibilityError.
Record fields are encoded and decoded in declaration order of the writer schema with no field names on the wire; reordering fields in the schema definition changes the binary format and breaks existing data.
SchemaRegistry prepends a 4-byte big-endian unsigned int schema ID (struct.pack('>I', id)) to encoded payloads, modeling Confluent's Schema Registry wire format; schema IDs are limited to [0, 2^32).
Schema("int") and Schema({"type": "int"}) are equal and hash-equal, enabling consistent set/dict usage regardless of whether the schema was constructed from a string or dict form
SchemaError is raised at parse time for structurally invalid definitions (a valid Schema object is guaranteed well-formed); SchemaCompatibilityError is raised during decode resolution or compatibility checking when two schemas cannot be reconciled.
The decoder's _skip method discards unwanted fields by recursively walking the writer schema rather than using byte-size jumps, which works correctly but requires understanding the element schema
When encoding a Python dict into a union containing both a record and a map type, the encoder resolves ambiguity by matching dict keys against record field names first, falling back to map only if no record matches
A union schema cannot contain two branches with the same type name (or record name); this constraint is enforced at schema parse time in Schema._parse.
AvroDecoder always requires both a writer schema (to parse the wire bytes) and a reader schema (to shape the output); this dual-schema resolution is Avro's core mechanism for schema evolution.
append_batch can leave a partial batch on disk if the process crashes mid-write, since events are individually appended to the NDJSON file without a transaction marker or write-ahead log
In batch sync mode, individual WAL appends do not fsync until writecount reaches batchsync_count (default 100), leaving up to 99 records vulnerable to crash loss
append_batch() stores all events in the batch first, then notifies subscribers of each event sequentially, ensuring subscribers see a consistent state where all batch events exist
The WAL writecount counter only resets when the batch threshold triggers a sync (line 133), not when a forced sync occurs, which may cause counter drift between forced and threshold-triggered syncs
The entire storage stack uses rigid binary formats that preclude both forward evolution and post-corruption recovery: WAL records are contiguously packed with no block alignment or version negotiation preventing resync after mid-file corruption, and SSTables lack per-entry checksums and efficient skip structures — neither layer can be upgraded in place or self-repaired after partial damage.
When the number of frozen segments exceeds autocompactthreshold (default 5), compaction is triggered automatically during put() operations
compact() re-serializes records directly to the output file rather than calling writerecord, duplicating the binary serialization logic in a separate code path
Hash-index-storage compaction is non-atomic with no error handling: a crash between deleting old files and completing the rewrite can leave the store in an unrecoverable state with no rollback mechanism.
Compaction rewrites records with their original timestamps (not the time of compaction), preserving timestamp-based ordering across merges.
During compaction, keys whose keydir entry points to the active file are excluded from merge output; their immutable-file copies are treated as stale because the active file holds a newer value.
Compaction produces .hint files alongside each merged data file (including mid-compaction when a merged file hits maxfilesize), enabling O(keys) index rebuilds instead of O(records) full scans on next startup.
Compaction performs delete-then-rename without journaling; a crash mid-compaction can leave the store in a state that _recover() cannot reconstruct correctly
After compact(), every key returns the same value as before compaction and deleted keys remain absent; compaction changes only physical layout, never logical state.
After compaction, merged files get IDs above the old active file, and the active file is renamed to an even higher ID to maintain the monotonically increasing file ID invariant.
BitcaskStore can rebuild its in-memory index by scanning .data files alone when .hint files are missing, producing identical read results to a clean startup with hint files present.
CRC32 is computed over keybytes + valuebytes only; the 12-byte header (which contains the CRC itself) is excluded from the checksum input
In the log-structured hash table, CRC32 validation occurs only during get() and scansegment(); writes compute and store the CRC but never verify it back, so a corrupted write is not caught until read time
Reading a record whose payload does not match its CRC32 header raises CorruptionError — the store never silently returns corrupt data
Calling delete() on a key that doesn't exist in the store completes without error.
File rotation triggered by maxfilesize is invisible to readers — all keys remain retrievable across multiple .data files because the keydir maps each key to its specific (file_id, offset).
syncwrites defaults to True, meaning every write_record call triggers an fsync — durable by default at significant write throughput cost.
BitcaskStore.get() returns None for missing or deleted keys, never raises KeyError.
If a segment file is deleted out-of-band or by a crash during compaction, get() raises an unhandled FileNotFoundError rather than returning None or a graceful error
get() opens a new file handle per call rather than using the filehandles cache, making the handle cache effectively unused for reads
get() performs exactly one disk seek per call; the key-to-offset mapping is resolved entirely in memory via self._index, making reads O(1) index lookup plus one disk read
The log-structured hash table's HEADERFMT = "!III" produces a fixed 12-byte header (three big-endian uint32s: CRC, keysize, value_size) at log-structured-hash-table/bitcask.py:10-11; total record size is always 12 + len(key) + len(value)
createhintfiles() skips tombstone records and only emits entries whose index currently points to that segment+offset, so hint-based recovery cannot distinguish "key was deleted" from "key never existed in this segment"
When a .hint file exists for a file ID, rebuildindex loads it instead of scanning the data file; hint files have no checksum validation, so they must be written atomically with their data files during compaction.
The in-memory keydir dict is the only index; every live key has exactly one KeyEntry pointing to its most recent non-tombstone record on disk.
Records have no CRC or integrity checksum; the only corruption guard is the assert read_key == key in get(), which catches index/data mismatches but not bit-rot.
Neither Bitcask implementation uses memory-mapped I/O; both use standard open()/read() with manual buffering, forgoing kernel page cache optimizations that production Bitcask implementations rely on for startup performance
Both Bitcask implementations process data/hint files strictly sequentially during keydir rebuild with no threading, multiprocessing, or async I/O, even though hint file loading is embarrassingly parallel
Neither Bitcask implementation has any range query, ordered iteration, or prefix scan capability; the hash-based keydir only supports exact-key point lookups
If a file matching segment*.dat contains non-numeric characters in the ID position, findexistingsegments crashes with ValueError; no validation guards against malformed filenames
On startup recovery, the store skips incomplete records at segment tail (header present but payload truncated) without raising errors or losing previously committed data
File handles in self.filehandles are shared and mutable (via seek), so concurrent calls to readrecord on the same fileid would race.
rebuildindex sorts file IDs ascending before scanning so that newer records overwrite older ones in the keydir, enforcing last-write-wins semantics.
readrecord and writerecord share the same binary layout (<dII header + key bytes + value bytes); a change to either must be mirrored in the other.
Recovery scans segments in ascending ID order so that newer writes overwrite older index entries, and the highest-ID segment becomes the active segment for appending
The os.rename calls in both Bitcask implementations rename active segments to frozen segments for namespace management, not for crash-safe atomic file creation — the file is already populated at its original path before the rename
File rotation is checked before each write, so the active file can exceed maxfilesize by up to one record's worth of bytes; the limit is a soft cap, not a hard one
When the active file is rotated, its read handle in file_handles is preserved so existing KeyEntry references pointing to the old file remain valid
In scandata_file, a truncated header safely breaks the scan loop, but a short key/value read after a valid header is not detected — the truncated data is silently used, potentially corrupting the keydir.
findexisting_segments uses os.listdir as the source of truth for which segments exist; there is no persistent manifest file — the filesystem is the manifest
The filename parsing in findexistingsegments (prefix/suffix slicing to extract segment ID) must stay in sync with the format string in segment_path; no shared constant enforces this coupling
File IDs are assigned by incrementing activefileid by 1 with no gap detection or collision check; compaction updates activefileid to avoid collisions after renumbering
Exactly one data file is open for writes at any time; all other files are immutable and read-only.
All tests in testbitcask.py pass syncwrites=False, meaning the durable fsync-per-write code path is untested.
scandatafile removes tombstoned keys from keydir during rebuild, but loadhintfile unconditionally inserts entries — it relies on compaction having already stripped tombstones before writing the hint file.
get() never encounters tombstone records because delete() removes the key from self._index; tombstone handling is solely a recovery and compaction concern
Deletion is represented by writing a record with an empty-string value; both scandatafile (checks valsize == 0) and get (checks value == "") use this convention.
BitcaskStore.delete() writes a tombstone record (b"_BITCASKTOMBSTONE__"); get() returns None for tombstoned keys, and compact() removes both the original entry and the tombstone
writerecord only appends to disk and returns a byte offset; it never modifies self._index, leaving index management entirely to callers (put, delete, compact)
BloomFilter and CountingBloomFilter use the identical _hashes() double-hashing scheme (MD5-based, lines 9-14), producing the same bit/counter positions for identical parameters — so a BloomFilter and CountingBloomFilter with the same m and k will agree on membership queries for the same insertions.
For SSTable compaction, rebuilding a fresh BloomFilter during the merge write pass is simpler and more correct than incrementally patching a CountingBloomFilter with remove(); compaction already pays O(N) I/O so filter construction adds only constant-factor overhead
_hashes() uses Kirschner-Mitzenmacher double hashing from a single MD5 digest, producing k positions as (h1 + i*h2) % m without k independent hash functions
The Bloom filter hash function is deterministic (not seeded with randomness), so two filters with identical parameters produce identical bit arrays for identical inputs — enabling equality comparison and reproducible serialization
BloomFilter generates k bit positions via Kirschner-Mitzenmacker double hashing: one MD5 digest is split into two 64-bit values h1 and h2, then positions are computed as (h1 + i*h2) % m rather than using k independent hash functions.
BloomFilter._len_ counts total add() calls, not distinct items — adding the same item twice yields len() == 2, unlike CountingBloomFilter which has the same property tracked separately
Bits in BloomFilter are only ever set, never cleared; this monotonic property makes the structure append-only, enables union via bitwise OR, and prevents deletion without a counting layer.
After BloomFilter.add(x), x in filter always returns True; the data structure guarantees zero false negatives by design (all k bit positions are set on add and all must be set for membership).
Neither LSM tree implementation (lsm.py or sstable.py) references or uses the bloom filter module; they exist as independent DDIA concept demonstrations with zero cross-module imports
BloomFilter computes bitcount as ceil(-n * ln(p) / ln(2)^2) and hashcount as round((m/n) * ln(2)), matching the standard optimal formulas from Bloom filter literature
BloomFilter.union() sets count = self.count + other.count, which overstates distinct items when filters share elements; estimatecount() from bit density is more accurate for unioned filters
BloomFilter.union() raises ValueError if the two filters differ in bit array size (m) or hash count (k), since bitwise OR is only meaningful when bit positions map to the same hash space.
_hashes does not guard against m=0, which causes an unhandled ZeroDivisionError in the (h1 + i * h2) % m computation
The k positions returned by _hashes are not guaranteed distinct — duplicate positions can occur when m is small relative to k, and all callers (add, contains, remove) tolerate this via idempotent bit/counter operations
_hashes splits the 128-bit MD5 digest at the midpoint into two 64-bit little-endian unsigned integers h1 and h2, used as base position and step size for double hashing
BloomFilter.to_bytes packs a 12-byte header (m, k, count as three little-endian uint32s) followed by the raw bit array, matching the pattern used by SSTable footers for sparse indices
Both storage paradigms in the reference implementations exhibit fundamental scalability constraints: the hash index requires all keys in RAM (making dataset size directly bound by available memory with no spill-to-disk fallback), while the LSM tree scans every SSTable on negative lookups because the correctly-implemented Bloom filter module is never wired into the read path.
Both testertest*.py and test*.py naming conventions match pytest's default collection patterns (test*.py and *test*.py), so pytest in any project directory runs the combined suite — note: this contradicts existing belief tester-naming-avoids-pytest-discovery; one should be retracted
BroadcastHashJoin loads the entire small dataset into a hash table at construction time via buildhash_table, so the small side must fit in a single mapper's memory
BroadcastHashJoin receives the small dataset at construction time and builds a hash table; the large dataset is streamed through .join(), reflecting the asymmetric API where the small side must be available upfront
Page allocation during splits modifies metadata through PageManager.write_meta (not WAL-logged), creating a potential crash-safety gap if the process fails between allocation and the subsequent WAL commit
Leaf lookups use bisectleft to find the leftmost insertion point, then check keys[idx] == key for exact match; bisectright would return the position after all equal keys, overshooting the target.
search uses bisectleft at leaf nodes for exact-match lookup and bisect_right at internal nodes for child routing; this asymmetry matches the split convention where the median key is copied up and retained as the first key of the right leaf.
pm.resetcounters() is called at the start of each public method, so stats.pagesread/written reflects only the most recent operation.
Corrupted WAL entries (CRC mismatch) are silently skipped during B-tree recovery rather than raising an error or halting replay.
B-tree deletion is structurally degrading: empty leaf pages are never freed, freed pages leave dangling parent pointers, and tree height only increases — causing monotonic space amplification and search-path lengthening over delete-heavy workloads.
Empty leaf cleanup only triggers when the current internal node is at depth 2 (direct parent of leaves) and the empty child is not the leftmost; empty leaves at depth > 2 or at index 0 are silently left in place
Deletes decrement total_keys immediately and the key becomes unreadable at once; no tombstone entries are written and no deferred compaction pass is needed
BTree.delete removes keys from leaves but never calls freepage on empty leaves, causing unbounded page file growth despite PageManager having a free list mechanism.
delete re-reads metadata after delete returns to pick up nextfree/freehead changes made by pm.freepage, rather than threading updated values through the recursive call stack.
When the key is not found, delete writes no WAL entries and performs no commit, making failed deletes purely read-only operations.
The delete path does not call freepage even when a leaf becomes completely empty; PageManager's free list is used only by allocatepage after explicit free_page calls from other paths.
_delete returns False (not found), True (deleted, no cleanup needed), or the string 'empty' (leaf now empty, parent should attempt cleanup); the public delete() collapses this to bool
_search trusts that the depth parameter from metadata accurately reflects the tree's balance; corrupted height causes silent wrong-format deserialization (leaf data parsed as internal or vice versa) with no error.
Both deserializeleaf and deserializeinternal perform no validation of page type, sort order, or buffer length; they trust that WAL CRC checks and the serialize functions guarantee well-formed pages, so corruption produces silent wrong results rather than errors.
B-tree mutations pay for os.fsync() twice: once when writing the WAL entry (btree.py:137) and again when committing the page to the data file (btree.py:105), with the WAL truncated only after the data file sync confirms durability.
The B-tree's durability model protects user data but not structural integrity: mutations pay double fsync for data pages (WAL entry + data write) while structural metadata is never fsynced, AND structural integrity silently erodes during normal operation (leaked pages, ever-growing height, dangling parent pointers after free_page) with no defensive checks in the I/O layer to detect or prevent degradation.
When all keys are removed from a non-root leaf, the page is returned to the free list via PageManager.free_page() rather than persisting as an empty node
The B-tree data file only grows (via nextfree bump in allocatepage) and never shrinks; no truncate, compact, or defrag operation exists anywhere in the module.
All six bugs documented in fix-plan.md concern single-writer WAL/fsync crash safety; none involve concurrent access races, illustrating that even without concurrency the WAL protocol has subtle correctness requirements
Freed pages store the "next free" pointer in their own body at HEADER_SIZE offset, forming an intrusive singly-linked list with no separate allocator data structure.
pm.free_page() called during deletion writes directly to the data file without WAL logging, creating a crash-consistency gap if the process fails between the free and the WAL commit
PageManager.free_page adds a page to the free list without removing the parent's downlink to it, creating a dangling-pointer risk on crash
The B-tree's PageManager tracks pagesread and pageswritten with a resetcounters() method (multiply by pagesize for bytes), but WAL writes are not counted by any existing counter
Since delete never merges internal nodes or demotes the root, tree height only increases (on splits) and never decreases, regardless of how many keys are deleted.
The reference implementation fully propagates splits upward through insert's return value but swallows the 'empty' signal in delete after one level of cleanup, making insertion structurally complete but deletion structurally incomplete
_insert never calls wal.commit(); the caller (put) is solely responsible for committing after all page writes and metadata updates complete, making the entire put atomic with respect to crash recovery
Internal node splits promote the midpoint key into the parent (removing it from the child), unlike leaf splits which copy it (the key remains in the right child); this asymmetry allows uniform bisect_right routing at all internal levels.
Leaf deserialization always decodes keys as UTF-8 strings while values remain raw bytes; invalid UTF-8 in key bytes causes UnicodeDecodeError at read time, not write time (since serialize accepts both str and bytes).
Leaves store a nextsibling pointer forming a left-to-right chain terminated by NOSIBLING (0xFFFFFFFF), enabling range scans without touching internal nodes.
Leaf pages store a nextsibling forward pointer (NOSIBLING = 0xFFFFFFFF sentinel) but no backward pointer, so unlinking a leaf from the sibling chain requires locating the predecessor via parent traversal rather than direct backlink.
When an empty leaf is removed during deletion, delete patches the previous sibling's nextsibling pointer to splice the empty leaf out of the linked list used by rangescan and iter_
Leaf splits copy the midpoint key into both the parent and the right leaf (key remains searchable in the leaf), while internal splits promote the midpoint key out of both children into the parent only
A single BTree.get() reads at most height + 1 disk pages, verified via pm.pages_read after counter reset.
With maxkeysper_page=4, inserting a 5th key into a leaf triggers a page split; tests use this small fanout to exercise multi-level tree structure with few keys.
BTree.writemeta writes root pointer and free list head directly to the data file without logging through the WAL, making metadata updates not crash-safe.
B-tree mutations pay double fsync costs for user data (WAL entry + data page) but skip fsync entirely for structural metadata, creating an asymmetry where key-value pairs survive crashes but the free-page list and allocation state may not.
Delete does not rebalance underfilled nodes; the only cleanup is freeing empty non-leftmost leaves at depth 2 — a deliberate simplification over production B-trees.
The B-tree's fixed-size pages (pagenum * pagesize addressing) provide natural resync boundaries for the data file — corruption of one page does not affect reads of other pages, unlike the streaming WAL formats
The B-tree's page I/O layer lacks defensive validation at every stage: deserialization accepts any bytes, leaf-finding doesn't verify node types, and page writes silently truncate oversized data — a single corrupted byte can cascade through the tree undetected.
B-tree PageManager overwrites pages at fixed offsets within a pre-allocated file, so most write+sync cycles do not change file size and would benefit from fdatasync skipping metadata I/O
PageManager has no application-level page cache — every readpage reads from the file and every writepage writes immediately, relying entirely on the kernel page cache for performance
Page 0 is always the metadata page (rootpage, height, totalkeys, nextfreepage, freelisthead); the root page is never page 0.
If a crash occurs after the right page and left page are WAL-logged but before the parent update, recovery produces a consistent sibling chain but the right page is unreachable from tree descent — range scans find the keys but point lookups silently miss them
BTree.put() on an existing key updates the value in-place without incrementing len(tree) — it is an upsert, not a blind insert.
put must re-read metadata after insert returns because PageManager.allocatepage mutates nextfree and freehead as a side effect during splits, and those changes are not threaded back through the return value.
insert returns a discriminated union: None (updated existing), 'inserted' (new key, no split), or (midkey, new_page) (split happened); root splits are handled in put(), not inside the recursion.
The next_sibling field sits at a fixed byte offset (bytes 3–6) in every serialized leaf page, making the sibling chain walkable without fully deserializing page contents
The B-tree sibling chain is actively maintained during all structural mutations with a fixed wire format: splits write the new right page with the old leaf's nextsibling pointer before rewriting the old leaf, deletions patch the previous sibling's pointer to splice out the removed leaf, and the nextsibling field sits at a fixed byte offset (bytes 3-6) enabling reliable traversal without full page deserialization.
Leaf pages serialize a nextsibling pointer (via serializeleaf and NOSIBLING sentinel) providing infrastructure for merge/redistribute, but no code path reads sibling pointers during deletion.
The B-tree's PageManager writes to a single pre-existing data file opened at construction, so normal operations never create new files and avoid the directory fsync gap that affects segment-based engines during rotation and compaction.
The B-tree implementation has no locking, latching, or concurrency control; all crash-safety gaps exist even without concurrent access
put (and delete) assume single-threaded access: metadata is re-read after insert/delete without any locking or compare-and-swap, creating a TOCTOU window under concurrent writers.
During a leaf split, _insert writes the new right page to the WAL before the modified left page, ensuring the sibling pointer target is durable before any pointer references it — so the chain never points to a nonexistent page after crash recovery
The B-tree module uses only stdlib imports (os, struct, zlib, bisect) with zero external dependencies.
B-tree structural integrity silently erodes over time: deletions cause monotonic degradation (leaked pages, ever-growing height) while the I/O layer performs no validation to detect or report the resulting inconsistencies.
Data written before BTree.close() is fully recoverable by constructing a new BTree on the same directory, including updates and deletes.
B-tree sequential access is strictly forward-only: iteration descends the left spine (following children at index zero at every internal level) to reach the leftmost leaf, then walks the forward-only sibling chain, with no backward pointer, reverse iterator, or random leaf access mechanism.
In-node key lookup uses bisectleft/bisectright (binary search), making within-node search O(log B) rather than the textbook O(B) linear scan — shifting the optimal branching factor higher than classical B-tree analysis predicts.
Every page mutation is logged to the WAL (with fsync) before being written to the data file; recovery replays the WAL, so no committed write is lost on crash.
In b-tree-storage-engine/btree.py, the WAL entry checksum covers only pagedata, so a corrupted pagenum during recovery writes a valid page to the wrong disk location without detection — the most dangerous integrity gap in the codebase
Each B-tree WAL entry is 16 + pagesize bytes (12-byte header with seq/pagenum/data_len, full page image, 4-byte CRC32), making WAL writes the dominant cost for small key-value updates
Each WAL.log_write() call performs an fsync, guaranteeing that WAL entries are durable before any data page writes proceed
The B-tree WAL in btree.py stores full after-image page data (physical WAL) and supports only redo recovery; there is no undo capability
PageManager.write_page calls flush() but never os.fsync(), so written pages may not be durable on crash even after the WAL has committed.
The WAL logs individual page writes and commits by full truncation; it cannot represent a multi-step structural operation as an atomic unit
Multi-page operations (splits, deletes that free pages) are made crash-safe by writing all page modifications to the WAL before applying them to the data file
Because the B-tree WAL stores complete page images, replaying entries already applied to the data file produces identical results, making the crash window after sync() but before truncate() harmless
The B-tree's WAL replays all valid (CRC-passing) entries on recovery regardless of whether a commit marker is present — uncommitted entries are applied without distinction.
The B-tree WAL commit follows a strict three-phase protocol: (1) write WAL entries + fsync, (2) fsync data file via page_manager.sync(), (3) truncate WAL + fsync WAL fd — crash-safe at every interleaving because WAL entries are physical page images making replay idempotent.
WAL.commit() truncates the WAL before the data file is fsynced, creating a crash window where both the log and the data can be lost.
The B-tree's WAL commit() only truncates after fsyncing the data file, making truncation idempotent — a crash mid-truncation simply replays the WAL entries again on recovery
The WAL truncation in commit() is the atomic linearization point: before truncation, recovery will redo the transaction; after truncation, the transaction is committed
The WAL logs complete page images (physical logging) rather than logical operations, which eliminates incomplete-split states during recovery but increases write amplification compared to PostgreSQL's logical WAL entries.
The B-tree WAL places its CRC32 checksum as a trailer after page_data, while the standalone WAL and Bitcask embed CRC in the record header — a structural difference that affects torn-write detection behavior
Join buffer size is bounded by events within window.duration of the current watermark; expireevents garbage-collects everything below that cutoff on every event arrival or advance_time call
The Bully Algorithm requires any node to message any other node directly via allnodeids; there is no ring or tree overlay topology
Candidates wait electiontimeout // 2 for ALIVE responses before deciding, shorter than the follower's full electiontimeout, to avoid cascading election delays.
Receiving an ELECTION from a lower-ID node triggers both an ALIVE response and a new start_election call, creating the O(n²) message cascade in worst-case elections
declare_victory sends a COORDINATOR message to every other node in the cluster (O(n) messages), not just to nodes that participated in the election
Election terms in getelectionhistory() are enforced to be monotonically non-decreasing across successive elections, validated by testtermsincrease_monotonically
The Bully Algorithm always elects the highest available node ID as leader; startelection only sends ELECTION to higher-ID nodes, and declarevictory only fires when no higher node responds.
When a failed node recovers via recover_node, it immediately starts an election, potentially preempting the current leader — this is the defining "bully" behavior.
Split-brain is detected and resolved by resolvesplit_brain in the cluster harness after each tick (forcing lower-ID leaders to re-elect), not prevented by the node-level protocol alone.
All messages generated within a single tick() call are delivered and fully resolved (including cascading responses) in a while all_messages loop before the tick returns.
Leader election tests control time via an explicit starttime parameter to rununtil_leader() rather than real timers or sleeps, making the entire test suite deterministic with no flaky timing
The byzantinemode field is set in init_ and never modified during protocol execution, so a node's fault behavior cannot change mid-protocol — simplifying reasoning about safety guarantees
LiveProjection implements the catch-up subscription pattern: replay historical events via readall(fromposition=...), then switch to push-based updates for future events, bridging pull/push at a known position with no gap
catchup() calls readall(fromposition=self.position + 1) to skip already-processed events, but onevent has no equivalent position guard — an asymmetry that enables duplicate processing
Projection.catchup() calls readall(fromposition=self.position + 1), so loading a snapshot sets the exact boundary — only events after the snapshot's position are replayed, making reconstruction cost proportional to events-since-snapshot
CountingBloomFilter uses one byte per counter position (bytearray(self._m)) versus one bit per position in BloomFilter, an 8x memory overhead for deletion support
CountingBloomFilter._len_ returns add calls minus remove calls, not distinct element count; adding the same item twice and removing once leaves len == 1
Each CountingBloomFilter counter uses a full byte in a bytearray regardless of counter_bits, using 8x more memory than a bit-packed representation
Unlike standard Bloom filters which never have false negatives, CountingBloomFilter.remove() can cause false negatives when two items share a hash position — removing one decrements the shared counter, potentially making the other appear absent
CountingBloomFilter.remove can silently succeed for an item never added if all its hash positions have non-zero counters from other items, corrupting the filter state
CountingBloomFilter.remove() raises ValueError if any hash position has a zero counter, providing a safety check against removing items that were never added
CountingBloomFilter.remove() validates all hash positions have non-zero counters in a first pass before decrementing any in a second pass, preventing partial state corruption when removing an absent item
When a CountingBloomFilter counter reaches maxval (default 15), it is never decremented, preventing false negatives at the cost of a slight false positive rate increase
When a CountingBloomFilter counter reaches maxval (default 15 for 4-bit counters), it is permanently frozen: neither add nor remove will change it
CDCConsumer/MaterializedView in cdc.py and Projection in event_store.py implement the same structural pattern — track position, pull new events, apply handlers — against different event sources (database mutations vs. domain events)
The CDC backbone that all derived systems depend on has two independent reliability gaps: event type semantics are determined by reconstruction heuristics rather than explicit markers (insert vs update distinguished by old_value presence, tombstones reported as None, snapshots use sentinel sequence numbers), AND the consistency requirements of derived systems (explicit flush to make mutations visible, old values for index maintenance) aren't reliably met by the CDC infrastructure.
Every ChangeEvent follows a strict before/after convention: INSERT has before=None, DELETE has after=None, UPDATE has both populated as full row copies, making events self-contained without querying the source database.
If a CDC consumer crashes mid-poll(), the position is not advanced, so all events from that batch will be reprocessed on the next call — providing at-least-once delivery with no deduplication.
A CDC consumer's position tracks the *next* sequence number to read (not the last one read); after processing, it advances to lastevent.sequence_number + 1.
CDCConsumer tracks its read position as an in-memory integer (_position) with no durable offset storage, meaning position is lost on process restart
The CDC layer distinguishes insert from update events by checking whether old_value is None; the WAL only knows PUT and DELETE, so the semantic enrichment happens at the CDC boundary
CDC event semantics are determined by reconstruction heuristics rather than explicit markers: insert vs update is distinguished by checking whether old_value is None, tombstones are reported as None values indistinguishable from missing data in conflict records, and snapshots use a sentinel sequence number (-1) to distinguish bootstrapped state from real events.
Insert vs. update is determined solely by whether CDCStream.emit() receives a non-None old_value; the WAL itself only records PUT operations for both cases
CDCLog.compact() retains only the most recent event per (table, key) pair, matching Kafka log compaction semantics but without concurrent-reader safety
CDCLog.compact() preserves the original sequence numbers of surviving events; it does not renumber them, so consumer positions remain valid references after compaction.
When a MaterializedView's transform function returns None for an INSERT or UPDATE event, the row is removed from the view rather than stored, acting as a filter on the derived table.
SecondaryIndex.processevent depends on CDCEvent.oldvalue to remove stale index entries during updates; without before-images, incremental index maintenance produces phantom references
SearchIndex removes all old tokens and re-adds all new tokens on every UPDATE event, even if only non-indexed columns changed — correct but not optimized for partial changes.
createsnapshot assigns sequencenumber=-1 to all synthetic INSERT events, distinguishing bootstrapped state from real log entries and enabling new consumers to load current state then seek to the live tail.
Every CDCDatabase mutation (insert/update/delete) writes to in-memory state and appends to CDCLog in the same synchronous call with no failure boundary between them
The consistent hash ring's hash function truncates MD5's 128-bit output to 32 bits via & 0xFFFFFFFF, matching the RINGSIZE of 2^32
getnodes walks clockwise past virtual nodes belonging to already-seen physical nodes, ensuring exactly replicationfactor distinct physical nodes in the preference list
ConsistentHashRing.ringpositions is maintained in sorted order at all times via bisect; all key lookups depend on this invariant
addnode and removenode return transfer map descriptions of which arcs move between nodes but do not move data themselves; the caller must act on the map
A consistent hash ring node with weight w gets int(num_vnodes * w) virtual nodes, so weight=0.5 produces half the default ring presence
Chained frame checksums (as in SQLite's WAL) prevent mid-log splicing, deletion, and reordering but require sequential validation from the chain start, eliminating the ability to validate or replay a single record in isolation
Checkpoint records are encoded with zero-length key and value fields; the record's semantic role is carried entirely by its OP_CHECKPOINT type and sequence number
replay() filters out CHECKPOINT records (along with COMMIT records), returning only PUT and DELETE — checkpoint records serve as truncation boundaries, not data operations visible to recovery consumers
OPCHECKPOINT records are written to the WAL but recovery (recoverseqnum) does not search for them; the caller must supply the checkpoint sequence number externally via replay(after_seq=)
The WAL supports OP_CHECKPOINT records that could serve as a durable truncation watermark (skip records below the watermark during replay), but truncate() physically removes records instead of using this mechanism
The & 0xFFFFFFFF mask in WAL._checksum is a Python 2 portability idiom — Python 3's zlib.crc32 already returns unsigned values, making the mask technically redundant but kept as defensive code
The Client.heldtokens dict is never cleared on lock expiry — the client retains and may use a token whose corresponding lock has already been acquired by another client
The code-expert workflow is read-only against the target repository — it scans, explores, and extracts beliefs but never generates or modifies source files
All implementations use NO-STEAL buffer management: uncommitted transaction data is never written to the persistent data store, eliminating the need for undo logging at the cost of requiring all dirty data from active transactions to fit in memory
mapoutputrecords is incremented before the combiner runs (line 108), so stats reflect pre-combiner volume; the test at line 68 asserts this explicitly
The combiner's callable signature matches the reducer's, but no runtime or static check verifies associativity or commutativity — a non-associative combiner (e.g., average) produces silently wrong results that vary with num_mappers
The reducer reads combiner output from intermediate JSON files identically to raw mapper output, with no metadata distinguishing pre-aggregated values from raw emitted pairs
On write-write conflict, commit() calls abort(tx) internally and returns False, so callers never need to manually abort after a failed commit; the transaction is dead after a False return.
The conflict-check-then-commit sequence in MVCCDatabase.commit() is not atomic — no locking protects the window between checking versions and marking committed, requiring single-threaded execution for correctness.
The write-write conflict check in commit() scans all versions of each key in the transaction's write set (not just the latest), making conflict detection O(versions) per written key.
OPCOMMIT is written exclusively by appendbatch() in write-ahead-log/wal.py; single append() calls never produce a COMMIT record, so individual operations are their own atomic unit
Read-only transactions bypass write-write conflict detection entirely in commit() and always commit successfully, since they cannot produce write-write conflicts.
Commit timestamps are assigned by commit() and stored in committimestamps, but isvisible never references them — visibility is determined purely by transaction IDs and the committed/active-at-start sets.
In hash-index-storage/bitcask.py, cached file readers for old immutable files are closed at the start of the merge-write phase, before all keydir entries have been updated to point to new locations, creating a window where reads would fail even without concurrent access
compact() removes old SSTable files from disk after merging, meaning any concurrent reader holding a reference to a deleted SSTable reads stale data (on Unix) or crashes (on Windows)
During log-structured-hash-table compaction, records in frozen segments are only kept if _index still points to that exact (path, offset), correctly excluding keys that were overwritten in the active segment after the frozen segment was closed.
During compaction, the entry with the highest SSTable sequence number wins for each key; sequence numbers are per-SSTable (all entries in one SSTable share the same seq), not per-entry
compact() mutates self.sstables without locking; concurrent flush or get calls during compaction can produce incorrect state or lose newly flushed SSTables
Log-structured-hash-table compaction is non-atomic: the sequence of write-new, update-index, delete-old, rename-active has no transaction boundary, and a crash mid-compaction can leave orphaned segments or a misnamed active file.
Tombstones (b"") are permanently removed during compaction and never written to the output SSTable; deleted keys disappear entirely after merge
Topic.compactpartition overwrites base_offsets[partition] with the first surviving message's offset (absolute assignment), discarding whatever the previous base was, because compaction removes messages from arbitrary positions.
Log-structured-hash-table compaction has no synchronization protecting the index, file handles, or segment counter during its multi-step mutation sequence; concurrent access during compaction would corrupt state.
Log-structured-hash-table compaction reads records from frozen segments without verifying CRC checksums, unlike scansegment which stops at the first CRC mismatch; a corrupted record could be silently copied into the compacted segment.
lsm.py:compact collects all merged entries into an in-memory list before writing the output SSTable, requiring O(n) memory proportional to total data rather than O(k) proportional to number of input SSTables
A crash during LSM compaction after tombstones are stripped from the merged output but before old SSTables are deleted can resurrect previously-deleted keys, because the tombstone that suppressed them no longer exists in any surviving SSTable.
compact() deletes old SSTable files immediately after replacing self._sstables, with no mechanism to defer deletion until active readers holding references to the old list finish their iterations
Compaction is the highest-risk operation yet operates within a durability pipeline broken at both ends: crash-unsafe compaction can permanently lose data or resurrect deletes, AND the surrounding infrastructure has fsync gaps (data may never reach disk) and unrecoverable integrity checks (corruption detected but not repaired) — the most dangerous operation has the least protection.
Compaction is the critical junction where crash safety and data lifecycle intersect: crash-unsafe compaction can permanently lose data or resurrect deleted keys (no write-temp/fsync/rename, delete-before-rename ordering), and fragmented tombstone semantics mean those corruptions propagate inconsistently through derived systems that require flush and old-value tracking for correctness.
Both the LSM and SSTable modules trigger compaction via explicit synchronous method calls (compact() / run_compaction()), not background threads — removing the write-amplification pressure that motivates least-overlap selection in production systems.
Compaction is unsalvageable as designed: it is the highest-risk operation within a durability pipeline broken at both ends (crash-unsafe writes that are unverifiable by testing), AND it triggers two independent failure modes under the concurrent access that production workloads produce (concurrent readers see inconsistent state, concurrent writers corrupt shared data structures), making it simultaneously the most critical and most dangerous operation.
No storage engine implementation uses the write-temp/fsync/rename pattern for file creation, and both Bitcask implementations delete old segments before renaming replacements, creating crash windows where data exists in neither old nor new files.
CompactionManager removes compacted SSTables from the in-memory _sstables list but never deletes their underlying files on disk; cleanup is the caller's responsibility.
The CompactionManager in sstable-and-compaction/sstable.py always calls mergesstables with the default removetombstones=False, making it conservatively correct but causing unbounded tombstone accumulation
CompactionManager implements both 'size_tiered' and 'leveled' compaction strategies, selected by a string parameter.
LSMTree.compact() performs a multi-step file swap (write new SSTable, update in-memory list, delete old files) with no mechanism to make the transition atomic across crashes
After compact_partition, surviving messages retain their original offset values, creating non-contiguous gaps in the offset sequence; offsets are never renumbered.
hash-index-storage/bitcask.py:compact copies records without integrity validation, so silently corrupted data survives compaction into new files
Bitcask compaction output is itself subject to maxfilesize rotation, so smaller size limits produce more post-compaction files — feeding back into recovery cost.
CompactionManager.run_compaction with leveled strategy assigns merged output to level 1 but does not verify that the new file's key range is disjoint from existing L1 files
The sstable.py CompactionManager separates *which* SSTables to merge (size-tiered vs. leveled strategy) from *when* to merge, but neither the manager nor any caller implements scheduling, rate-limiting, or load-aware deferral logic.
The compaction_threshold parameter (lsm.py:204) directly controls how many overlapping SSTables can accumulate before compaction, setting the worst-case missing-key probe count to threshold - 1
LSM compaction runs automatically when len(self.sstables) >= self.compactionthreshold (default 4), triggered at the end of flush after the new SSTable is registered
VectorClock.compare returns exactly one of BEFORE, AFTER, EQUAL, or CONCURRENT, implementing the partial order defined by component-wise comparison.
The expectedversion optimistic concurrency check in appendbatch runs before any state mutation, so a ConcurrencyConflict exception is safe and leaves the store completely unchanged.
Concurrency safety is absent from both mutation and query paths: core components (B-tree, garbage collector, consistent hash ring) silently assume single-threaded write access with no locks or assertions, and range scans lack snapshot isolation, cycle guards, or concurrent modification protection, meaning concurrent workloads can corrupt state through both writes and reads independently.
Compaction under concurrent access is unsafe in two independent failure modes: concurrent readers and writers have no synchronization (no locks, latches, or snapshots on either path), AND compaction itself has no crash-safe file operations — concurrent access can corrupt live state silently, and a crash during this unsynchronized compaction produces irrecoverable data loss.
A conflict in applyremotechange is only detected when localorigin != remotenode; two writes from the same origin at different timestamps are treated as sequential updates and resolved by tuple comparison without generating a ConflictRecord.
Conflict resolution is implemented in two disconnected modules: CRDTs encode resolution in their merge semantics (self-resolving), while the strategy enum covers only LWW and custom-merge — leaving no unified API and no bridge between the two approaches.
Gossip-based failure detection and Raft consensus interact in a way that can compound partition hazards: gossip's timeout-driven liveness set determines cluster membership, while Raft's partition behavior means an isolated leader silently accepts uncommittable writes and its inflated term forces re-election upon rejoining — if gossip's failure detection misclassifies a partitioned-but-live leader, it may trigger membership changes that interact with Raft's already-disruptive partition recovery.
Raft and Total Order Broadcast represent contrasting approaches to Paxos optimization: Raft centralizes proposal authority in a single elected leader (one election per term, then Phase-2-only replication), eliminating dueling proposals within a term, while TOB allows any node to propose for any slot using full two-phase Paxos each time, making it susceptible to competing proposals. These represent different points in the leader-based versus leaderless tradeoff space.
Consensus protocol correctness is doubly unverified: both Raft and TOB represent untested inverse optimizations of Multi-Paxos whose safety properties diverge specifically under asynchrony and crash failures, AND the testing methodology covers neither crash nor asynchronous failure modes — the exact conditions under which the two optimization strategies would reveal different safety profiles.
Both consensus mechanisms represent inverse optimizations of Multi-Paxos (Raft centralizes proposal authority in a leader, TOB decentralizes it to allow any node to propose), but neither variant's safety has been validated under the asynchronous conditions it is designed for — all protocol tests use deterministic synchronous delivery with no real network I/O, message loss, or reordering.
Adding a node that already exists silently returns an empty transfer map and does not change the ring's node count, vnode positions, or key assignments.
ConsistentHashRing defaults to 150 virtual nodes per physical node, and exposes loadimbalance (maxload / avg_load) to measure distribution quality
getnodes(key)[0] always equals getnode(key) — the preference list's head is the primary replica.
Adding an Nth node to the ring moves approximately 1/N of keys, not a full reshuffle — the defining property that makes consistent hashing useful for dynamic cluster membership.
getnodes() raises ValueError rather than silently under-replicating when replicationfactor exceeds the number of physical nodes on the ring.
Key lookup via get_node is O(log V) where V is total virtual nodes, using bisect on the sorted position list; node addition is O(V) per vnode due to list.insert.
ConsistentHashRing does not validate that numvnodes, weight, or replicationfactor are positive; zero or negative values silently produce degenerate ring states (e.g., a node with zero vnodes exists in _nodes but owns no ring arc).
ConsistentHashRing has no synchronization; concurrent addnode/removenode calls corrupt the sorted ringpositions and ringnodes lists.
ringpositions is maintained in sorted order at all times via bisect insertion; ringnodes is kept parallel to it with len(ringpositions) == len(ringnodes) always holding.
The ring's hash space is [0, 2^32), as validated by testringpositionvalidrange.
All consistent hash ring positions are computed via MD5 truncated to 32 bits; the hash function is not pluggable
The ring hashes keys and vnode identifiers using hashlib.md5 truncated to 32 bits via & 0xFFFFFFFF, chosen for distribution quality rather than security.
The weight parameter on add_node scales virtual node count proportionally, supporting heterogeneous nodes with different capacities
A new consumer joining a group loads committed offsets from the broker (tested at testpartitionedlog.py:152), enabling offset-based resume but also duplicate redelivery on crash
Each DerivedSystem tracks its own LSN position independently via _position, allowing consumers to fall behind or catch up at different rates without blocking each other
Passing a vector clock as context to VersionedKVStore.put declares the write descends from that version; omitting context treats the write as concurrent with all existing versions
Membership and data convergence operate at fundamentally different rates: gossip-based membership changes propagate in O(log N) rounds via epidemic-style random peer selection, but data convergence in ring topology requires O(N) sync rounds because each round advances changes by exactly one hop via store-and-forward requeuing — creating a window proportional to cluster size where the membership view is current but data remains stale.
The gap between expected and actual system correctness widens under failure: distributed protocols require storage-layer guarantees (crash-safe compaction, atomic writes, complete CRC coverage) that no implementation provides, and the resulting divergence accumulates without bound because anti-entropy can detect but not fully reconcile the inconsistencies.
System correctness is both unachievable and unfalsifiable: the gap between specification and implementation widens under every failure mode (distributed protocols require unmet storage guarantees, replica divergence accumulates without bound, storage degrades monotonically), and the testing methodology cannot detect these gaps (crash paths untested, protocols validated only under synchronous simulation) — the system cannot be correct and cannot discover that it is not.
Corruption is a terminal condition across every binary record reader in the codebase: each stops at the first CRC failure with no resync capability, and the WAL specifically halts all replay rather than skipping the corrupt record, meaning any single corrupt byte truncates the recoverable history.
Corruption propagates silently through every data transformation pipeline: both Bitcask compaction implementations copy records without CRC validation, and hint file generation reads source records without integrity checks — every transformation step amplifies corruption rather than filtering it.
In scansegment, a CRC mismatch or short read stops scanning the entire segment; records after the corrupt point are silently lost from the index even if they are individually valid.
Count stage materializes its entire input into a defaultdict before yielding any output, making it a pipeline barrier that forces all upstream stages to complete first
At designed capacity (n ≤ expected_items), the expected number of saturated 4-bit counters is effectively zero because the per-position load follows Poisson(ln 2 ≈ 0.693), making P(count ≥ 15) ≈ 10⁻¹⁵
CountingBloomFilter uses reference-counting semantics: an item added N times requires N remove() calls before it tests as absent via _contains_
At 10× overload (10× expected_items inserted), roughly 0.5% of CountingBloomFilter counters saturate; at 20× overload, approximately half saturate, effectively degrading the filter to a non-counting Bloom filter
Using CountingBloomFilter.remove() during compaction would require merge_sstables to report which keys were discarded, which the current implementation does not do — duplicates and tombstones are silently skipped in the merge loop
CountingBloomFilter uses saturating 4-bit counters so entries can be removed via remove(), which standard BloomFilter cannot do — relevant for maintaining filters over mutable data structures
The counting Bloom filter pays 8x memory overhead and introduces false negatives through its removal operation for a capability (element deletion) that the primary use case — immutable SSTables — never needs, since SSTables are write-once and discarded whole during compaction.
CountingBloomFilter.remove() is architecturally suited for maintaining filters over mutable in-memory structures (memtables) where keys can be overwritten without rescanning, not for SSTable compaction where the entire file is rewritten
CountingBloomFilter allocates one full byte per counter position (bytearray(self.m) at line 112) despite counterbits defaulting to 4, using 8x the space of a standard BloomFilter's bit array rather than the expected 4x.
Crash and failure recovery paths are systematically excluded from the test suite: the WAL has no tests for truncated records or CRC mismatches, LSM crash testing covers only WAL replay and ignores compaction crashes entirely, and SSI write-skew tests exist only in standalone tester files outside the default pytest runner — the most critical correctness scenarios have the least test coverage.
Crash recovery is simultaneously broken and unverified: no storage engine has a safe crash recovery path (non-atomic compaction, batch-blind replay, metadata-excluding CRC), and crash/failure paths are systematically excluded from the test suite — recovery bugs will persist indefinitely because neither the broken mechanisms nor their absence is tested.
CRC32 checksums in the WAL and Bitcask detect partial writes after a crash but cannot recover the lost data — they convert silent corruption into detected data loss, not into a recoverable state
The & 0xFFFFFFFF mask applied after zlib.crc32 in all modules is a Python 2 compatibility artifact; Python 3's zlib.crc32 already returns an unsigned 32-bit integer, making the mask harmless but unnecessary.
The CRC32 integrity checks in log-structured-hash-table/bitcask.py detect corrupt records within a segment but provide no protection against the cross-file atomicity problem during compaction.
Switching from ISO 3309 (zlib.crc32) to Castagnoli (CRC-32C) is a backward-incompatible wire format change; existing data files would fail CRC verification without a format version mechanism in the file header.
CRC32 is computed per individual record or page (typically hundreds of bytes to a few KB), not per file; file-level corruption is detected only indirectly by failing to decode a record, not by any file-wide checksum
All four CRDT types implement _eq to compare semantic state (not object identity), enabling CRDTReplicaGroup.allconverged() and the semilattice property tests to verify convergence via equality checks.
All four CRDT merge methods (GCounter, PNCounter, LWWRegister, ORSet) are idempotent: merging the same state twice produces the same result as merging once
All four CRDT types demonstrate idempotent merge (re-merging produces no change), semantic equality comparison, and monotonic ORSet tombstones. The sync_all test confirms convergence after two rounds, consistent with merge being commutative and associative, though these properties are exercised by tests rather than proven by the antecedents alone.
merge() mutates and returns self (enabling chaining like deepcopy(a).merge(b)) rather than returning a new instance, which means callers must deepcopy before merging if they need to preserve the original state.
CRDT mutation APIs lack defensive programming patterns at the boundaries: merge mutates the receiver in place and returns self (requiring callers to defensively copy before merging to prevent aliasing), and remove operations on absent elements silently return False rather than raising errors, making accidental no-ops invisible.
Each CRDT type in crdts.py (GCounter, PNCounter, LWWRegister, ORSet) encodes conflict resolution in its own merge() method, requiring no external strategy enum or callback
All four CRDT types (GCounter, PNCounter, LWWRegister, ORSet) use state-based replication via a merge() method that implements a join-semilattice; no operation log or causal delivery is used
Checkpoints, batch commits, and segment rotations all bypass the configured sync mode by calling fsync unconditionally, creating a two-tier durability model where structural WAL operations are always durable even when individual record writes are not.
Data distribution and processing both hit fundamental scalability ceilings: neither partitioning strategy provides both reliable routing and ordered access (hash destroys order, range depends on unverified parallel-array invariants), while query and aggregation operations hide unbounded memory barriers behind uniform tuple interfaces, meaning neither the data layout nor the query path scales gracefully under load.
scandatafile in the hash-index Bitcask reads each record's header and key but seeks past value bytes; it is still O(datasize) because headers must be parsed sequentially to locate record boundaries
All 37 modules configure behavior entirely through constructor parameters (e.g., maxfilesize, syncmode, pagesize) — there are no config files, environment variables, or settings modules.
Each top-level directory is a self-contained module with no cross-module imports; implementations are intentionally standalone so each concept can be understood in isolation.
All implementations are Python; the storage engine modules use only stdlib except for sortedcontainers in the LSM tree — no frameworks or production infrastructure.
The testertest*.py files are designed as standalone scripts that print "testname PASSED" / "testname FAILED" to stdout and include if _name == "main_" blocks, enabling automated verification without pytest.
Most modules follow a consistent three-file pattern: one implementation file, one test file, and a testertest*.py file (likely a meta-test or test harness validator).
decodewithid does not validate the message format before parsing; any 4+ byte input will be interpreted as schema-ID-prefixed data, producing a confusing KeyError on invalid input rather than a clear format error
savesnapshot() uses copy.deepcopy on the projection state dict to create an independent copy, ensuring subsequent event processing during catchup() does not mutate the saved snapshot
System degradation is irreversible: the system degrades monotonically at every abstraction level with no self-healing (leaked pages, growing tree height, accumulating divergence), and no subsystem at any architectural tier has a viable recovery strategy to arrest the erosion — recovery infrastructure is either vestigial, paradoxically over-engineered, or fundamentally missing.
Both Bitcask implementations delete old segment files (os.remove) before renaming the compacted replacement, creating a crash window where neither old nor properly-named new data is on disk
Delete propagation is fragmented end-to-end: tombstone representations differ at every storage layer (ambiguous empty-bytes sentinel, premature compaction purging, replication-dependent lifetime), and derived systems require explicit flush plus old-value CDC events that inconsistent tombstones cannot reliably provide.
The same three-condition visibility rule (committed, not in activeatstart, lower txid) is applied independently to both the creating and deleting transaction of a version — is_visible runs the same logic in two phases.
Derived systems (secondary indexes, materialized views) require two independent conditions for consistency: explicit flush to make mutations visible, and CDC old-value capture to remove stale index entries — either condition failing silently produces stale reads.
Every DerivedSystem in the unbundled database tracks a position cursor indicating how far through the CDC log it has consumed, enabling independent catch-up and ensuring consumers can resume from where they left off
The derived-system pattern (secondary indexes, materialized views, projections) depends on event infrastructure that is unreliable at its foundation: event sourcing has no durable recovery mechanism and conflates two ID spaces, while derived systems require explicit flush calls and old-value CDC events for consistency — the consumer-side correctness requirements depend on producer-side guarantees that do not hold.
Each DerivedSystem tracks its own LSN position independently, allowing consumers to fall behind or catch up at different rates without coordinator state or blocking
Every DerivedSystem implements rebuild(events) that clears state and replays from scratch, guaranteeing eventual convergence with the CDC event log regardless of prior state
json.dumps(sort_keys=True) ensures all honest PBFT nodes produce identical SHA-256 digests for identical requests without inter-node coordination, transforming trust verification into a deterministic computation
The accepted_preprepare dict enforces a one-to-one mapping from (view, sequence) to digest, preventing a Byzantine primary from assigning two different requests to the same protocol slot
Backup replicas independently recompute the SHA-256 digest from the request payload in every PRE_PREPARE message; they never trust the primary's claimed digest value
loadexisting_sstables using os.listdir cannot distinguish pre-compaction from post-compaction file sets after a crash — if the crash occurs after creating the merged SSTable but before deleting the inputs, recovery sees duplicates; the inverse loses data.
The distributed cluster has no reliable foundational infrastructure: membership detection is unreliable in both accuracy (no adaptive thresholds) and propagation (asymmetric convergence rates between membership and data), and ordering infrastructure is broken at every layer (vestigial WAL sequence numbers, conflated event sourcing ID spaces, volatile CDC consumer positions), meaning neither who is in the cluster nor what order things happened can be answered reliably.
Distributed correctness is doubly unachievable under network partitions: protocols require storage-layer guarantees (crash-safe compaction, CRC-protected metadata) that aren't met, AND partitions amplify the resulting gaps through disrupted gossip-based failure detection and stale leader writes — the prerequisites for correctness are absent even before partitions introduce additional failure modes.
Distributed system correctness is undermined at both the storage and protocol layers: storage engines silently assume single-threaded access (no locks, no assertions, no documentation), while the quorum protocol weakens its own semantic guarantees (counting hint storage as successful writes, allowing sub-quorum configurations without error).
Replica divergence accumulates without bound: write operations have compounding correctness gaps (sloppy quorums count hints, sub-quorum configs accepted, conflict resolution split across modules), and the repair mechanism (Merkle-based anti-entropy) cannot fully reconcile because tombstone semantics differ at every layer.
Distributed protocols rest on doubly invalid foundations: end-to-end correctness requires storage-layer guarantees (crash-safe compaction, CRC-protected metadata) that no storage engine provides, and protocol safety claims are unfalsifiable under the current testing methodology (synchronous simulation, no crash path tests) — the protocols assume both correct storage and correct testing, and have neither.
All distributed protocol implementations use synchronous message delivery: PBFT runs the full three-phase protocol in a single deterministic call, bully elections resolve cascading responses within one tick, and Lamport clocks deliver messages in the same call stack — none model the network asynchrony that is the core difficulty of distributed systems.
Any tombstone garbage collection strategy must define a maximum tolerated node downtime; tombstones removed before a down node receives them cause data resurrection via read repair or anti-entropy.
In the multi-leader replication module, tombstones cannot be safely removed until all replicas have received the delete, adding a replication-convergence constraint beyond the local compaction-coverage constraint
Distributed write correctness has compounding weaknesses: quorum semantics are weakened by sloppy quorum and permissive configuration, and the conflict resolution that should catch remaining inconsistencies is split between two disconnected mechanisms (self-resolving CRDTs and an incomplete strategy enum).
dosync(force=True) always calls fsync regardless of the configured sync_mode, ensuring durability-critical operations like segment rotation and WAL close are never silently skipped
DocumentPartitionedDB.querybyfield always iterates every partition regardless of result count, touching exactly num_partitions partitions (scatter/gather with no short-circuit).
DocumentPartitionedDB.put() always touches exactly 1 partition regardless of how many fields are indexed, because the secondary index is co-located with the document on its home partition.
Durability bugs are permanently invisible: the write-to-verify durability pipeline is broken at both ends (fsync policy gaps prevent data from reaching stable storage, and incomplete integrity checks cannot verify it arrived), while crash/failure recovery paths are systematically excluded from testing — the system cannot detect its own durability failures through any available mechanism.
The write-to-verify durability pipeline is broken at both ends: fsync policy inconsistencies across critical paths mean data may never durably reach disk (B-tree skips fsync for structural metadata while paying double for data), and incomplete CRC coverage means corrupted data that does reach disk passes integrity checks undetected (payload-only CRC leaves routing metadata unprotected).
DynamoCluster.put only stores hints after the write quorum is already met on real nodes, unlike HintedHandoffStore which counts hints toward the quorum — making DynamoCluster a strict-quorum-with-opportunistic-hints design, not a true sloppy quorum
When multiple replicas hold different values at the same max version number, ReadResult.is_conflict is True and value is a list of all conflicting values rather than a single resolved value
When put() fails to meet write quorum, it decrements the version counter to prevent version gaps, then raises QuorumNotMet
Hinted handoffs for all unavailable nodes are stored on available_nodes[0] only, creating a single point of responsibility for hint delivery and a potential hotspot
The leaderless replication implementation (dynamo.py) has no delete or tombstone mechanism; adding deletes without tombstones would cause resurrection via read repair or anti-entropy
Version counters in DynamoCluster.versioncounters are scoped per key, not global across the cluster; independent keys maintain independent version sequences
DynamoCluster.get() repairs all available nodes after every read, not just the quorum participants, trading higher per-read write fan-out for faster convergence compared to ReadRepairStore.
DynamoCluster.get() repairs all stale replicas reachable during the read, not just enough to satisfy the quorum — every replica with a version below the max is updated
Hinted handoff only occurs when DynamoCluster is constructed with sloppyquorum=True; without it, deliverhints() returns 0 and offline nodes receive no data until anti-entropy runs
Version assignment uses a single coordinator-level counter per key (versioncounters), not per-replica counters, which prevents version conflicts but requires a centralized coordinator
DynamoCluster.put() sends writes to every available node, not just W of them; the quorum check gates on acknowledgment count, not send count, maximizing replica consistency
A predicate that matches zero keys still appends a predicate lock to tx.predicatelocks, enabling phantom detection if a concurrent transaction later inserts a key that would match
SchemaRegistry.encodewithid uses a raw 4-byte big-endian schema ID prefix with no magic byte, unlike Confluent's 5-byte header (0x00 + 4-byte ID), so there is no format versioning for future wire format changes
End-to-end distributed correctness is unachievable: protocol-layer weaknesses (sloppy quorums, single-threaded assumptions) depend on storage-layer guarantees (atomic recovery, batch integrity, metadata checksums) that no implementation provides.
The entry count stored in the SSTable header by SSTableWriter.finish() is trusted by SSTableReader but never validated against the actual number of entries read from the data section
In EQUIVOCATING mode, each broadcast message is replaced by N-1 targeted messages with per-peer distinct digests computed via compute_digest, meaning the output list can be larger than the input
Event IDs are sequential integers per-stream starting at 1, not globally unique identifiers; global_position is the separate cross-stream counter.
global_position increments monotonically across all streams and equals the total number of events ever appended to the store.
LiveProjection automatically applies new events on each append() after the initial catchup() call — no subsequent explicit catchup() required.
append(..., expected_version=V) succeeds only if the current stream version equals V; a stale version raises ConcurrencyConflict.
EventStore disk persistence uses JSONL (newline-delimited JSON) format, loaded in full on construction to rebuild the in-memory state.
Projection.catch_up() only processes events after its current position, returning the count of newly processed events; a second call with no new events processes zero.
Projection.savesnapshot() persists both the projection state and its current position; loadsnapshot() restores both so that catch_up() resumes from the snapshot point.
Event sourcing snapshots on store.snapshots are plain Python dicts monkey-patched onto the store instance with no serialization; getattr(self.store, '_snapshots', {}) defensively handles their absence
Event sourcing snapshots are triply limited: stored as ephemeral in-memory dicts (lost on restart), excluding subscriber callbacks (requiring re-registration), and ignored by reconstruct_state (which always replays from the beginning).
reconstructstate(handlers, events, upto=N) replays exactly the first N events to rebuild past state at that point in time.
test_verify.py runs as a standalone Python script with inline assertions and a final print statement, not through a test framework like pytest
reaches assumes the event graph is a DAG with no cycle detection; this is safe because Node methods only ever link to previously-created events via parent and _cause.
The event causality graph maintains correctness through two complementary invariants: acyclicity by construction (preventing infinite traversal in reachability checks) and identity-based comparison (preventing conflation of structurally-equal but distinct events).
eventid is a global sequence number assigned as len(self.events) + 1, starting at 1 and incrementing by 1 for each appended event with no gaps within a process lifetime
Derived systems depend on event infrastructure that is independently unreliable in both content and ordering: events may be lost, duplicated, or mis-addressed due to unreliable persistence and addressing, AND their ordering is broken at every layer — WAL sequence numbers are vestigial, event IDs conflate two spaces, and CDC consumer positions are volatile.
Both Projection.catchup and DerivedSystem.processevent track a position cursor, pull/receive ordered events, and apply type-dispatched handlers — the same structural pattern solving "derived data" from different starting points
Event sourcing read operations conflate two ID spaces: projections assume contiguous stream-scoped IDs for catch-up arithmetic, while state reconstruction filters on global event_id, and sequence numbers survive compaction without renumbering — creating potential mismatches when events are removed or compacted.
Event sourcing has no durable recovery mechanism: appends update in-memory state before persisting to disk with no rollback on failure, snapshots are ephemeral in-memory dicts lost on restart, and state reconstruction must always replay from the beginning with no incremental checkpoint support.
Event sourcing is fundamentally unreliable at two independent levels: it has no durable recovery mechanism (in-memory state updated before disk, no crash-consistent append, ephemeral snapshots), and its core addressing abstraction conflates stream-scoped and global ID spaces across read paths, meaning even successfully persisted events may be incorrectly reconstructed.
Event store appends update in-memory state before persisting to disk with no rollback on failure, batch appends can leave partial writes, and the per-call file open/close pattern adds overhead without providing atomicity guarantees.
EventStore.loadfromfile replays every NDJSON line on startup with no mechanism to detect or discard incomplete batches left by a mid-batch crash, unlike the WAL which uses OPCOMMIT records as transaction boundaries.
Event store appends events to the in-memory events list before calling persist_event, with no rollback on write failure — a crash or I/O error leaves the in-memory state ahead of disk with no mechanism to reconcile.
The event store uses newline-delimited JSON where per-line resync is architecturally possible, but loadfrom_file halts on the first json.JSONDecodeError rather than skipping the bad line
event-sourcing-store/eventstore.py:persist_event relies solely on Python's context manager close() for implicit flush() with no explicit flush() or fsync(), providing no durability guarantee against either process crash or power loss.
EventStore.append() accepts an expected_version parameter and rejects appends when the stream's current version doesn't match, implementing optimistic concurrency control for event streams
EventStore.append takes an expected_version parameter for optimistic concurrency on stream appends, rejecting writes when the stream has advanced past the caller's version
EventStore.persistevent() in event-sourcing-store/event_store.py opens the file in a with block, writes JSON, and relies on implicit close() — no explicit flush() or os.fsync(), making persisted events vulnerable to loss on OS crash
EventStore.persistevent opens the NDJSON file in append mode, writes one JSON line, and closes it without calling flush() or os.fsync(); written data may remain in OS buffers or Python buffers at crash time.
EventStore in event-sourcing-store/eventstore.py is the repo's existing model for optional disk persistence: accept persistpath in constructor, call loadfromfile on init if the file exists, call persist_event on every mutation
EventStore.persistevent performs a separate open()/write()/close() cycle for each individual event, so append_batch with N events results in N independent file operations rather than a single buffered write.
EventStore assumes single-threaded access: eventid = len(self.events) + 1 is a race condition under concurrency, and no locking protects events, streams, or subscriber notification.
The event store's expectedversion check-then-act pattern in append and appendbatch has no locking, making the optimistic concurrency guard safe only under a single-writer concurrency model — concurrent writers create a TOCTOU race
expireevents uses watermark - duration as a one-sided cutoff for buffer cleanup, while contains() is symmetric; this is correct because future events can only arrive at or after the watermark
Sort with memory_limit produces the same output ordering as an in-memory sort — the external merge-sort is a pure implementation optimization with no semantic difference.
No implementation uses os.fdatasync(), missing a safe optimization for append-only WALs where only data (not metadata like mtime) needs to reach disk.
Python's os.fdatasync exists only on Linux, not macOS/Darwin; any switch from os.fsync requires a platform fallback via getattr(os, 'fdatasync', os.fsync) to work on the current development environment
Every file opened in append mode ("ab") in this codebase writes sequentially with monotonically increasing file size, making fdatasync a safe drop-in replacement for fsync on those paths (WAL, Bitcask data files, B-tree WAL)
FencedResourceServer.write rejects writes where fencing_token < highest (strict less-than, not less-than-or-equal), meaning same-token retries succeed — enabling idempotent writes with the same lock acquisition
All denial conditions in fencing tokens (lock held, stale token, wrong client) return sentinel values (None, False, or error dicts) rather than raising exceptions, modeling distributed system responses
FencedResourceServer.write() rejects any write with a fencing token strictly less than the highest token previously seen for that resource; equal tokens are accepted
The fencing token counter is global across all locks, not per-lock; tokens issued for different locks are comparable and strictly ordered by acquisition time
LockService._counter starts at 1, increments by 1 on every successful acquire(), and is never decremented or reset, guaranteeing strict monotonicity of issued tokens
Token high-water marks on FencedResourceServer are scoped per resource identifier, so a lower token can succeed on a different resource than the one that saw a higher token
The FencedResourceServer has no TTL or expiration logic for tokens — once a token number is seen, all lower tokens are permanently rejected for that resource
find_conflicts returns True if and only if at least two VersionedValue entries in the input list have concurrent (mutually non-dominating) vector clocks
findleaf, search, and insert all use bisectright (not bisectleft) for child-pointer routing, ensuring all three methods agree on which leaf owns a given key; this is consistent with the split strategy where the separator key equals the first key of the right sibling.
findleaf does not verify that a page is an internal node before calling deserializeinternal; a corrupted tree height or misidentified page type produces silently wrong results rather than an error.
findleaf is called exclusively by rangescan; point lookups use search instead, which reads the leaf inline and returns the value rather than the page number.
When two concurrent transactions write the same key, the first to call commit() succeeds and the second is aborted automatically
Only 3 of the test*.py files use @pytest.fixture (testcdc, testmapsidejoins, testbitcask); most pytest files still use inline setup identical to their tester counterparts
Every os.fsync() call in the codebase is immediately preceded by a .flush() call, ensuring Python's userspace buffer is drained to the kernel page cache before requesting stable storage persistence.
flush() both clears self.memtable and appends to self._sstables, creating a window where data exists in neither location if a concurrent reader checks between the two mutations
Each flush call creates an SSTable file with a strictly increasing sequence number (zero-padded filename for lexicographic = numeric sort), preserving newest-last ordering in self.sstables
flushindex() is the sole mechanism that materializes async index updates: it iterates self.pending, calls applyindex_op() for each entry, and returns the count of operations applied
flush writes the frozen memtable directly to an SSTable without staging it in immutablememtables, creating a brief window where in-flight keys are invisible to get() despite immutable_memtables being checked in the read path
Passing force=True to dosync() causes flush+fsync regardless of the configured sync mode, enabling group commit at batch boundaries
The only dosync(force=True) call sites (lines 165 and 175) correspond to segment rotation and WAL close, both operations where proceeding without a flush would risk data loss or file corruption
The system can neither self-heal during operation nor be evolved to add self-healing capabilities: runtime degradation is permanent (leaked pages accumulate, tree height only grows, no rebalancing occurs, crash recovery has no safe path), AND the rigid binary formats across the entire storage stack prevent adding recovery mechanisms such as resync points, version negotiation, or structural checksums.
The system cannot evolve its way out of known corruption vulnerabilities: the rigid binary format design across the entire storage stack prevents forward evolution and post-corruption recovery (no block alignment, no version fields, no extensibility), and the system simultaneously lacks defense-in-depth against the corruption these formats cannot recover from (no input validation, no resync capability).
The free list head pointer is stored as the fifth field of the metadata page (page 0) in the layout [rootpage:4B][height:4B][totalkeys:4B][nextfreepage:4B][freelisthead:4B], making it durable across restarts.
The page free list is LIFO: freepage pushes to the head and allocatepage pops from the head, so the most recently freed page is reused first.
The B-tree's free page list is a LIFO stack: freepage pushes onto the head and allocatepage pops from the head, so pages are reused in reverse order of when they were freed.
allocatepage and freepage share an implicit contract on free-list node layout (3-byte zeroed header + 4-byte big-endian next-pointer); neither validates the other's output, so a format change in one silently breaks the other
LIFO free-list reuse means delete-then-reinsert cycles progressively decorrelate logical key order from physical page order, defeating sequential I/O readahead for range scans.
PageManager.freepage is only invoked by BTree.delete when removing an emptied leaf node; no other code path frees pages.
PageManager.free_page() overwrites the freed page's content with a zero header and a free-list pointer, destroying the original leaf/internal node data so freed pages cannot be accidentally read as valid tree nodes.
Each free-list page occupies a full page (default 4096 bytes) but stores only 7 bytes of useful data (3-byte zero header + 4-byte next pointer); the remaining bytes are zero padding from write_page.
SSTableWriter.finish() hardcodes level=0 (sstable.py:104), matching LevelDB's invariant that flushed memtables always produce L0 files
The codebase has a systematic fsync policy inconsistency between components: WAL critical operations (checkpoints, batch commits, segment rotations) always force-fsync regardless of configured mode, but B-tree metadata mutations (which are equally critical for crash recovery) skip fsync entirely despite paying double fsync for user data pages.
os.fsync appears in the WAL append path and Bitcask record writes but never in SSTable creation, compaction output, or hint file writes — durability is applied to the append hot path only
The complete data lifecycle is unsafe from storage maintenance through data retrieval: compaction is the highest-risk operation that can permanently lose or resurrect deleted data, and the read path from SSTable through CDC to derived systems is unreliable at every stage, meaning data is at risk whether it is being reorganized for efficiency or being served to consumers.
Both the physical storage layer and the logical transaction layer are independently fragile under restart, creating a full-stack restart hazard: the B-tree's durability model protects user data pages but not structural metadata (height, sibling chain, free list), while the transaction system's isolation model depends on monotonic counters and abort-as-status-change semantics that have no persistence backing.
When no transactions are active, garbage_collect() retains at most one version per key — the latest committed non-deleted version — and drops everything else including fully-deleted keys.
garbagecollect() mutates versions (replacing lists and deleting keys) without synchronization, assuming single-threaded execution — concurrent reads or writes during GC would race on the dict and its lists.
Garbage collection removes only versions unreachable by any active transaction, ensuring long-running read-only transactions see consistent data
garbagecollect() removes all versions with createdby in _aborted as its first step, regardless of active transaction state — aborted versions are invisible to everyone and always safe to drop.
With active transactions, garbagecollect() uses min(active txids) as the GC horizon; only versions superseded by a committed version with txid < mintx_id are eligible for removal.
getpendingchanges returns the current _pending list and replaces it with an empty list, ensuring each change is delivered exactly once per replication cycle
Each gossip round transmits the entire membership list per exchange despite a fanout of only one peer per node, making per-message payload O(N) in cluster size and total per-round bandwidth O(N squared) across the cluster, even though only N point-to-point messages are sent.
GossipNode.receive_gossip merges heartbeat counters and node status (alive/suspected/dead) only; it never exchanges application key-value data, making it a SWIM-style failure detector rather than a data replication protocol
Dead nodes are removed from the membership list after t_cleanup elapsed time (default 20), preventing unbounded growth of the membership table from accumulated failure records
After t_cleanup elapses, a dead node's record is fully deleted from membership (not just flagged), which enables clean rejoin with a fresh identity
The gossip simulation uses explicit logical timestamps passed as parameters (not wall-clock time), with a fixed RNG seed for deterministic gossip partner selection, following the same simulated-time pattern as hinted-handoff
Full membership convergence occurs within O(log N) gossip rounds, empirically bounded by 5 * log₂(N) + 5 rounds in the test suite
Dead nodes in the gossip protocol follow a comprehensive, irreversible lifecycle with three independent safeguards: death status cannot be reversed by incoming gossip messages, dead node records are fully removed from the membership list after the cleanup interval (not merely flagged), and incoming gossip about already-dead nodes from other peers is silently filtered to prevent zombie reintroduction through stale state.
When receiving gossip, unknown nodes that arrive with dead status are silently dropped to prevent zombie membership entries from propagating through the cluster
Once a node's status is dead, receiving a gossip message with status: alive for that node will never revert it — the merge logic at line 67 checks local["status"] != "dead" before allowing exoneration
All inter-node data transfer (sendgossip, join, getmembership_list) uses copy.deepcopy to prevent shared mutable state between simulated nodes
detectfailures makes decisions using only currenttime - timestamplastupdated; it consults no historical distribution or sliding window, making it a pure point-in-time comparison rather than a statistical inference.
GossipNode.getalivemembers returns the liveness set that anti-entropy, read-repair, and hinted-handoff all depend on to decide which nodes to contact — gossip provides the membership layer that every data-exchange layer needs
Gossip-based failure detection is the single correctness bottleneck for the distributed cluster: replication, read repair, and hinted handoff all depend on its timeout-driven liveness set, which is bounded by cleanup to prevent unbounded membership growth.
Gossip failure detection is both the single correctness bottleneck for the cluster AND permanently miscalibrated: it governs all replication, read repair, and hinted handoff decisions, yet stores only the most recent timestamp per peer with no arrival history, making it unable to adapt detection thresholds to actual network conditions as phi-accrual detection would require.
Each node selects exactly one random peer per gossip round, with bidirectional exchange producing at most N pairwise syncs per round.
Failure detection uses exactly three fixed time thresholds (tsuspect=5, tdead=10, t_cleanup=20) producing a deterministic state machine with no runtime adaptation to network conditions.
receive_gossip only accepts remote state when the remote heartbeat counter strictly exceeds the local counter, except for death notifications at equal counters which are also accepted
Each node stores only the most recent timestamplastupdated per peer, discarding all inter-arrival time history that would be needed for probabilistic failure detection (e.g., phi accrual).
Node status follows alive → suspected → dead → removed with configurable timeouts tsuspect, tdead, t_cleanup governing transitions; a suspected node can return to alive if a higher heartbeat counter arrives
Suspicion in the gossip protocol is triggered by elapsed time exceeding t_suspect since the last heartbeat update, not by failed probe responses as in the full SWIM paper's ping/indirect-ping protocol
GossipCluster uses unrestricted random peer selection (any node can gossip with any other), making the effective topology a fully-connected graph.
All nodes in a GossipCluster receive identical threshold values from the cluster constructor at _init_ (line 117); individual nodes cannot calibrate sensitivity to their specific network path.
Each gossip round transmits the entire membership list via deepcopy rather than deltas or piggybacked updates, making per-exchange bandwidth cost O(N) in cluster size instead of O(1) amortized
A leaving node sends its death status to every active peer (not just a random one) before deactivating, unlike normal gossip which uses random pairwise exchange
Both reconstruct_state and Projection expect handler functions to mutate the state dict in place rather than returning a new value; a handler that returns without mutating silently loses its changes
happensbefore calls reaches in both directions to distinguish "a before b", "b before a", and "concurrent" — a single call can only confirm or deny one direction.
Both Bitcask implementations keep every live key in an in-memory dict (keydir / _index), meaning the key set must fit in RAM; there is no disk-based fallback for partial index spill.
The hash-index-storage Bitcask records contain no CRC or checksum field, unlike the log-structured-hash-table implementation; on-disk corruption is undetectable during reads or compaction.
hash-index-storage/bitcask.py uses a single cached file handle per segment for all reads via getreader(), making concurrent reads to the same segment unsafe due to shared seek position.
The hash-index-storage/bitcask.py compact() must be called explicitly by the caller; there is no auto-compact trigger, threshold tracking, or background compaction unlike the log-structured-hash-table variant
hash-index-storage/bitcask.py's get() checks if value == "": return None, providing a defense-in-depth layer against tombstone leaks that log-structured-hash-table's get() lacks
The hash-index-storage hint file contains all four keydir fields (file_id, offset, size, timestamp), making it sufficient to fully reconstruct the in-memory index without reading data files
Hash index storage is fundamentally memory-bound: every key must reside in RAM for the single-seek O(1) read path, making dataset size directly constrained by available memory with no spill-to-disk fallback.
Both Bitcask implementations require every live key to be held in a Python dict in memory; the dataset's key space is bounded by available RAM
hash-index-storage/bitcask.py intentionally omits CRC to focus on keydir/append-log/compaction; log-structured-hash-table/bitcask.py provides the integrity-checking variant with zlib.crc32 and CorruptionError
Neither hash index implementation (hash-index-storage/bitcask.py or log-structured-hash-table/bitcask.py) provides a scan, range, or ordered-iteration method; only exact-key point lookups are supported.
A Bitcask get() does one dict lookup plus one positioned disk read (O(1)), while an LSM-tree get() may search the memtable then multiple SSTables from newest to oldest (O(log N) per level)
In hash-index-storage Bitcask, the syncwrites flag controls whether each write_record call fsyncs to disk; when False, recently written records may be lost on crash.
writerecord timestamps use time.time(), which is not monotonic — NTP adjustments or manual clock changes can cause a newer write to carry an older timestamp, confusing compaction's latest-wins conflict resolution.
writerecord in hash-index-storage/bitcask.py does not modify keydir; the caller (put or delete) is responsible for updating the in-memory index after the write returns.
Hash-mod partitioning (hash(k) % num_reducers) scatters lexicographically adjacent keys across different partitions, making range queries require a full scatter-gather across all reducers; MapReduce run() re-sorts the final results to compensate
In mapreduce.py, numreducers is fixed at job creation (numreducers: int = 2) and never changes during execution; all partition assignments are determined by hash(k) % num_reducers with no dynamic rebalancing
Three modules use the same hash(k) % numpartitions pattern with fixed partition counts and no hot-key mitigation: mapreduce.py:109, secondaryindexpartitioning.py:56, and partitionedlog.py:149 — all share the straggler vulnerability where a single hot key overwhelms one partition
The codebase consistently uses zlib.crc32 for corruption detection in storage engines and hashlib (SHA-256) for content addressing (Merkle trees), uniform distribution (consistent hashing, bloom filters), and cryptographic integrity (BFT) — the two hash families never cross purposes
A corrupted keysize or valsize in hash-index-storage/bitcask.py causes scandata_file to read wrong byte boundaries, potentially misinterpreting all subsequent records in that file
The merge heap tuple includes a source_index field solely to prevent Python from comparing SSTableEntry objects when (key, -timestamp) ties, which would raise TypeError.
A corrupted hint file silently loads bad offsets into the index with no validation; a missing hint file triggers a CRC-checked full scan via scansegment, making hint corruption strictly worse than hint absence.
Hints in HintedHandoffStore have a TTL; if triggerhandoff doesn't run before createdat + hintttl, expirehints silently drops the hint, permanently losing that replica's copy of the data
Hint files are written without fsync; they are a best-effort startup optimization that can be lost on crash without data loss, since the data file remains authoritative.
Neither the hint file writer nor reader performs checksum validation; a truncated or corrupted hint file causes a struct.error or IndexError on startup rather than a graceful fallback to data file scanning.
In hash-index-storage/bitcask.py, a crash between deleting a data file and its hint file leaves an orphaned hint that directs reads to a nonexistent data file, causing hard failures on get().
Hint files are written directly to the final path (not via temp-file-and-rename), so a crash mid-generation leaves a partial hint file; loadhint_file handles this by stopping at the first short read
Hint files accelerate index rebuild during recovery but carry no crash-consistency guarantees; they are not consulted to determine which data files constitute the valid current state
A hint entry is written only when self._index[key] confirms the key's canonical location is the current segment and offset, filtering out entries superseded by later writes or compaction
Hash-index hint uses a 24-byte fixed header per entry (fileid:u32, offset:u64, size:u32, timestamp:f64) while log-structured hint uses only an 8-byte header (keysize:u32, offset:u32), omitting file_id, timestamp, and record size
writehintfile and loadhintfile must use identical binary layouts: [HINTFORMAT header (24 bytes)][keylen uint32 (4 bytes)][key_bytes (variable)] per entry; any change to one without the other breaks backward compatibility.
Each hint record is HINTHEADERSIZE (24 bytes) + 4 (keysize uint32) + keylength bytes; the HINT_FORMAT constant covers only the fixed 24-byte portion, not the full record.
createhintfiles does not verify CRC checksums on the segment records it reads (unlike scansegment), so corrupted data can be silently indexed via hint files
createhintfiles never generates a hint file for the active segment, only for frozen (immutable) segments, since a hint for the active segment would be immediately stale
The offset value stored in a hint entry is a byte position in the corresponding .dat segment file, not a position within the hint file itself; get() uses this offset to seek directly into the segment.
writehint_file is called exclusively from compact(), never during normal put/delete operations; hint files only exist for compacted (merged) data files.
removehintsfor() purges all hints for a recovered target node regardless of whether individual hints were successfully delivered or had expired; after trigger_handoff, no hints for that node remain on any other node.
If a hint node crashes and its hints are lost before handoff, the target node never receives the data and no error is raised — requiring anti-entropy repair (e.g., Merkle trees) to detect and fix.
Hint expiration uses exclusive comparison: a hint created at time t with TTL d survives at t + d - 1 but is expired at t + d.
Write failures are signaled via success: False in the return dict; the hinted handoff module never raises exceptions in normal operation.
Each unavailable preferred replica gets at most one hint stored on one non-preferred node per write operation, preventing hint explosion.
get() only queries preferred replicas, so data existing only as hints on non-preferred nodes is invisible to reads until handoff delivers it.
When sloppyquorum=True (the default), stored hints count toward writequorum the same as direct replica writes, trading consistency for write availability.
When sloppy quorum is enabled and preferred replicas are down, writes succeed by routing to non-preferred substitute nodes that store hints for later delivery.
With sloppyquorum=False, a write fails if fewer than writequorum preferred nodes are available, even if non-preferred nodes could serve.
All time-dependent operations (put, triggerhandoff, expireallhints) take currenttime as an explicit argument rather than calling time.time(), making the implementation fully deterministic and testable.
Hints expire after createdat + ttl (checked via Hint.isexpired); a node that stays down longer than the TTL window will not receive those hinted writes and must rely on anti-entropy for convergence.
The coordinator assigns strictly increasing versions per key via a centralized counter, so stale hint replays are absorbed harmlessly by Node.put()'s version >= existing_version check.
All three storage engines rebuild their in-memory index (keydir, _index) entirely from on-disk files at startup; inconsistent file state after a crash produces a silently incomplete index with no error
During initial file creation, PageManager._init writes metadata and the root leaf page directly via write_meta without WAL protection; acceptable because no user data exists yet to lose.
Input validation is systematically absent across configuration and API boundaries: the WAL accepts arbitrary strings as sync mode (silently disabling all durability guarantees), truncate accepts out-of-range sequence numbers (silently deleting all data), compaction strategy selection silently falls through to leveled on any unrecognized string, and Merkle proof direction is never validated — misconfiguration produces incorrect behavior rather than failing fast.
Data integrity checking fails in two complementary ways: CRC checksums exclude routing metadata (sequence numbers, page numbers, headers), so corruption in those fields goes undetected; and every reader halts at the first detected CRC failure with no resync capability — meaning both detectable and undetectable corruption are terminal.
Data integrity verification degrades monotonically along the storage pipeline: WAL records have CRC checksums that exclude routing metadata (partial coverage), SSTables have no checksums at all (zero coverage), and neither layer can recover from detected corruption — data transitions from partially verified to completely unverified as it moves through compaction from WAL to SSTable.
serializeinternal assumes len(children) == len(keys) + 1 but does not assert it; too few children raises IndexError, while extra children beyond keys+1 are silently dropped — a potential data-loss path if the invariant is violated.
Internal nodes have no sibling pointer field; their binary layout is `type:1B][num_keys:2B][child₀:4B... with no right-link, and deserializeinternal returns a 2-tuple (keys, children) vs. the leaf's 3-tuple (keys, values, next_sibling)`.
Internal page wire format is [type:1B][num_keys:2B][child0:4B] followed by repeating [keylen:2B][key:varB][child:4B], all big-endian with no alignment padding; this interleaved layout mirrors the logical separator-pointer structure.
_iter reaches the leftmost leaf by following children[0] at every internal level, then walks the sibling chain to NOSIBLING to enumerate all keys in sorted order
Unlike get, put, delete, and rangescan, BTree.iter does not call resetcounters() before running, so I/O page-read stats accumulate on top of any prior count — a gotcha when benchmarking iteration.
Before reading, iterate() calls fd.flush() but not os.fsync(), so buffered writes reach the OS page cache but are not guaranteed durable on stable storage.
iterate() releases the WAL lock after flushing but before reading begins, so concurrent appends may or may not be visible to the iterator depending on timing.
When iterate() encounters a CRC mismatch, iteration stops with no exception or sentinel value — the caller receives all valid records before the corruption with no indication that more data existed.
iterate() yields records of every op type (PUT, DELETE, COMMIT, CHECKPOINT) while replay() filters to only PUT and DELETE records.
The stream join processor's output is fully determined by the sequence of (streamname, key, value, timestamp) inputs and advancetime calls, with no dependency on wall-clock time — making it deterministically testable without clock mocking
TimeWindow.contains() uses abs(t1 - t2) <= duration, making join matching a symmetric interval check centered on each event, not aligned to any fixed time grid
The join TimeWindow duration controls which event pairs can match, while the TumblingWindowAggregator window size controls result grouping; these are independent parameters that can differ without violating any invariant
KeyRangeMerkleTree.diff_keys() returns the string key names of divergent entries, translating numeric leaf indices back to keys
KeyRangeMerkleTree sorts input pairs by key before building, so two trees with identical key-value content always produce the same root hash regardless of insertion order
happens_before(a, b, events) returns True (a causes b), False (b causes a), or None (concurrent) — modeling a strict partial order with explicit concurrency detection.
happensbefore() determines causality by BFS over parent/cause edges in the event DAG, not by comparing timestamps; the allevents parameter is accepted but never used.
LamportMutex grants critical section access to the requesting node with the lowest timestamp, with other requesters waiting until the holder releases.
canenter() returns True only when the node's request is at the queue head (lowest (timestamp, nodeid)) AND acknowledgments have been received from every other node in the system.
receive_tick computes max(local, received) + 1, ensuring every receive event has a timestamp strictly greater than the corresponding send event, which is the core Lamport clock invariant.
LamportClock.receivetick(remotets) computes max(local, remote_ts) + 1, ensuring the clock advances past both the local and remote state on every receive.
Node.send_message(target, payload) has a side effect on the target node: it appends a RECEIVE event to the target's event log and advances the target's clock.
sendmessage calls tonode.receive_message() directly in the same call stack, making message delivery instantaneous and deterministic with no async queue or network simulation.
a.timestamp < b.timestamp does NOT imply a→b; two concurrent events can have ordered timestamps, which is a fundamental limitation of Lamport clocks that vector clocks address.
totalorder() sorts by (timestamp, nodeid), using lexicographic node ID comparison as a deterministic tiebreaker when timestamps are equal.
totalorder() sorts events by (timestamp, nodeid), using the node identifier as a deterministic tiebreaker for concurrent events with equal timestamps.
Events with timestamp < watermark - allowedlateness are unconditionally dropped and counted in stats.lateevents_dropped; there is no secondary path to recover or re-buffer them
lcscompact performs at most one merge operation per invocation and returns immediately; cascading compactions across levels require the caller to loop on run_compaction().
L0 compaction always includes every L0 SSTable in the merge set (not just a subset), which can cause large write spikes when many L0 SSTables accumulate before the trigger fires.
Level 0 in leveled compaction triggers based on SSTable count (l0compactiontrigger), not total size, because L0 SSTables can have overlapping key ranges unlike higher levels which maintain non-overlapping invariants
Level size budgets grow as base_size × fanout^(level-1), so with defaults of 10MB base and 10× fanout: L1=10MB, L2=100MB, L3=1GB, etc.
Leveled compaction sizes each level as levelbasesize * fanout^(level-1) with a 10MB default base and 7 max levels, so each level is 10x the previous by default
During level N compaction, lcscompact calls _overlapping once per SSTable in the level to find the one with maximum overlap, then once more for the winner — making SSTable selection O(n*m) in level sizes.
For level N→N+1 compaction, the SSTable with the most overlapping neighbors in the next level is chosen — the opposite of LevelDB/RocksDB's least-overlap heuristic, increasing write amplification per compaction.
Leveled compaction merges all inputs into one SSTable rather than splitting output by size, which means levels above L0 will contain overlapping key ranges after compaction — violating the real LCS invariant that each level has non-overlapping SSTables.
The next_sibling leaf chain is correctly maintained through splits because the WAL logs the right page before the left page (ensuring the pointer target is durable before anything references it) and no concurrent reader can observe intermediate states
Each leaf entry costs 4 + len(keybytes) + len(valuebytes) bytes (2B key length + key + 2B value length + value), so maximum keys per page depends on actual data sizes, not just a fixed max_keys count.
Values are stored exclusively in leaf nodes; internal nodes contain only routing keys and child page pointers, making this a B+-tree not a B-tree
Every leaf page has a fixed 7-byte overhead (3B header via HEADERFMT = '>BH' + 4B nextsibling pointer) before any key-value entries, setting the usable capacity to page_size - 7 bytes.
Leaf next_sibling pointers serve range scans only, not Lehman & Yao concurrent-split recovery; internal nodes have no right-links, making this a standard B+ tree, not a B-link tree
When a leaf splits, the new page inherits the old page's nextsibling pointer and the old page's nextsibling is updated to point to the new page — two field writes during an operation that already rewrites both pages
Leaf pages are laid out as [type:1B][numkeys:2B][nextsibling:4B][entries...] where each entry is [keylen:2B][key][vallen:2B][val], all big-endian
The existing leaf next_sibling pointer is necessary but insufficient for Lehman-Yao concurrent access; the algorithm also requires right-links on internal nodes and a high key per node, neither of which this implementation has
After leveled compaction runs on L0 SSTables, the output readers have level == 1 set by the caller.
After leveled compaction, resulting SSTables are assigned level = maxlevel + 1 where maxlevel is the highest level among inputs, establishing the level hierarchy (observable in test assertion at test_sstable.py:119).
Leveled compaction triggers when the L0 SSTable count meets the l0compactiontrigger parameter
LinearizableRegister.read() broadcasts through the TOB consensus layer rather than reading local state, because a node cannot distinguish "I have the latest value" from "I'm partitioned and stale" without majority confirmation.
Live projections achieve exactly-once processing semantics through three reinforcing mechanisms: position-guarded deduplication prevents double-processing during overlapping catch-up and subscription, position advances even for events with no registered handler (preventing future re-evaluation), and the catch-up-then-subscribe pattern eliminates gaps between historical replay and live events.
A LiveProjection must replay all events from position 0 on initialization because there is no snapshot restore path wired into its construction
LiveProjection.onevent guards against double-processing via eventid <= position, making it safe to overlap catch_up() and live subscription without duplicate handler invocations.
LiveProjection processes events synchronously inside the append() call path via the _subscribers mechanism, meaning the caller blocks until all live projections have updated
Handler exceptions in LiveProjection.onevent propagate through EventStore.append() to the caller; the projection's position is not updated, leaving it behind but recoverable via catchup().
LiveProjection is never constructed with a snapshotinterval, so the automatic snapshot path in catchup() never triggers for live projections
Once constructed, a LiveProjection cannot be unsubscribed from its store; the callback reference persists in _subscribers indefinitely with no removal mechanism.
Events with no registered handler still advance LiveProjection.position, preventing re-evaluation on subsequent catchup() calls — position tracks last *seen* event, not last *handled* event.
LiveProjection._init registers the subscriber before the user calls catchup(), making duplicate processing possible but event loss impossible in the current single-threaded design
LiveProjection receives automatic updates by registering a callback in EventStore.subscribers (a list of Callable[[Event], None]]), which is invoked synchronously inside append() and appendbatch()
Both put and delete call recordseen with the local node's ID, ensuring the originating node drops its own changes when they loop back through the ring
LockService.acquire checks isexpired(currenttime) before rejecting a competing acquire, so a crashed or GC-paused client's lock is automatically reclaimable after TTL without requiring explicit release
A participant in the "prepared" state holds key-level locks (self.locks[key] = tx_id) that cause any subsequent transaction touching the same keys to abort with a lock conflict during its own prepare().
log-structured-hash-table/bitcask.py uses only flush() without os.fsync() on data writes, unlike hash-index-storage which calls os.fsync per record when sync_writes is enabled
In log-structured-hash-table/bitcask.py, compaction closes activefile during its write phase, making all reads and writes to the active segment impossible until the active file is reopened under a new name at the end of compaction
The log-structured Bitcask exposes createhintfiles() as a standalone operation callable independently of compaction, while the hash-index variant only writes hint files inside compact()
In log-structured-hash-table/bitcask.py, hint file creation requires an explicit createhintfiles() call rather than being produced automatically as a side-effect of compaction
Hint entry format uses !II (two u32 network-order integers), capping segment file offsets at 4 GiB regardless of maxsegmentsize configuration.
The log-structured-hash-table Bitcask variant stores only (filepath, offset) in its index, omitting record size and timestamp, requiring an extra header parse on every get() to discover record boundaries
log-structured-hash-table/bitcask.py uses TOMBSTONE = b"_BITCASKTOMBSTONE__" as an in-band sentinel in the value space; storing those exact bytes as a value causes silent data loss on recovery when the record is misinterpreted as a deletion
getnode performs a single bisect.bisect over ring_positions, so key lookup is O(log(N×V)) regardless of cluster size.
BitcaskStore.get() in log-structured-hash-table/bitcask.py returns raw payload bytes without checking for the TOMBSTONE sentinel, so an index entry pointing to a tombstone record surfaces b"_BITCASKTOMBSTONE__" as a value
The log-structured-hash-table hint entry format (!II = keysize + offset) has no field to distinguish live records from tombstones, making loadhintfile structurally unable to replicate scansegment's deletion logic
Neither log-structured-merge-tree/lsm.py nor sstable-and-compaction/sstable.py compute or verify any checksums; a single bit-flip in a length-prefix field causes cascading misframing of all subsequent records
The LSM tree's compact() method does not use os.rename or os.replace; grep for atomic rename operations returns zero matches in the LSM module, meaning the compaction output is written directly to its final path with no atomic swap.
The LSM Tree's compact() method removes tombstones because it performs a full merge of all SSTables, guaranteeing no surviving SSTable can contain a superseded live value
LSMTree.compact() runs synchronously in the caller's thread, meaning a put() or delete() that triggers the SSTable count threshold will block until the full k-way merge completes.
The LSM tree's compact() calls os.remove(sst.path) without an explicit fsync on the new SSTable or its parent directory beforehand, meaning the new file's data may not be durable when the old file is unlinked
A crash during LSM compaction produces duplicate entries (old and merged SSTables both present) rather than data loss, because newer SSTables take read precedence.
compact() merges all SSTables into a single new SSTable (size-tiered, single-level), not incremental or leveled — simpler but with higher space amplification.
Both LSM implementations resolve key conflicts during compaction by keeping only the newest value; older values are unconditionally discarded with no user-defined merge logic
LSM crash recovery is structurally impossible: the set of live SSTable files exists only in memory with no persistent manifest, so a crash permanently loses the SSTable list, and the WAL that could help reconstruct state has no integrity checking to validate its own records during replay.
testcrashrecovery in test_lsm.py only covers WAL replay for unflushed memtable entries; no test exercises crashes during SSTable compaction, leaving the most dangerous data-loss window (mid-compaction file deletion) untested.
LSMTree.compact() in lsm.py merges all SSTables into one without level hierarchy, making every compaction implicitly a bottommost-level operation where tombstone removal is always safe
The LSM _flush() truncates the WAL after writing an SSTable, but since neither the SSTable write nor the truncation is fsynced, a crash can lose both the WAL source data and the SSTable destination simultaneously — making the data irrecoverable
SSTable iterators (scan, scan_all) support only forward iteration; there is no Prev() or reverse scan capability, preventing ORDER BY DESC or backward cursor pagination
LSMTree.get() iterates through every SSTable in reverse sequence order and returns only after checking all of them when a key is absent; without bloom filters, missing-key lookups are O(N) in SSTable count
heapq is imported but never referenced in the code; compact() uses list sorting (sort by (key, -seq)) instead of a heap-based k-way merge.
flush line 307 frozen = self.memtable captures a Python reference to the SortedDict, not a deep copy, so concurrent mutation of the dict after the swap would be unsafe without synchronization
When the number of entries in the active memtable reaches memtable_threshold, the memtable is automatically frozen and flushed to a new SSTable on disk
The LSM tree scans every SSTable on negative lookups because the Bloom filter module — which is correctly implemented with textbook-optimal sizing — is never wired into the read path.
get() searches memtable, then immutable memtables, then SSTables in reverse-seq order (newest first), returning the first match; this guarantees newer writes shadow older ones without scanning all levels.
The LSM tree in lsm.py has no locking, reference counting, or Version snapshots; compact() mutates self._sstables in place, so concurrent readers could see inconsistent state or reference deleted SSTables
The LSM tree has no persistent metadata log (MANIFEST); the set of live SSTable files exists only in memory (self._sstables) and is not crash-recoverable independently of directory scanning
LSMTree in lsm.py has zero locking, atomic swaps, or synchronization primitives; all shared state (memtable, sstables) is mutated in-place without protection
The LSM-tree engine has no built-in byte or operation counters for writes; all write instrumentation (WAL appends, SSTable flushes, compaction output) must be added from scratch
lsm.py:range_scan loads all matching entries from every SSTable and memtable into a dict before returning any results, making first-result latency proportional to total result set size
The LSM tree's replay method (lsm.py:28) returns List[Tuple[str, bytes]], discarding WAL sequence numbers and making gap detection impossible at the LSM layer
The LSM tree uses a separate WAL (lsm.py:14-64) for crash recovery because its memtable is volatile; the hash index writes directly to the append-only data file, which serves as its own recovery log and needs no WAL.
The SSTable sparse index samples every 16th key by default; a point lookup binary-searches the sparse index then linear-scans up to 16 entries within the candidate block.
LSMTree.sstables is mutated by both flush() (append) and compact() (full replacement) with no synchronization, versioning, or ref counting, making concurrent reads unsafe if threading or async is added
The LSM test suite (testlsm.py) runs all operations sequentially with no concurrency, so race conditions between flush(), compact(), and range_scan() are never exercised
Deletion is represented as TOMBSTONE = b"" (empty bytes), which means empty-byte-string values and deleted keys are indistinguishable at the storage layer.
LSMTree in lsm.py maintains a flat list of SSTables ordered by sequence number with no level assignment; compaction merges all files into one rather than promoting between levels
Compaction uses heapq-based k-way merge with prevkey deduplication (lsm.py ~line 323), while rangescan uses a dict-based materialize-everything approach (lines 275–298) — two fundamentally different merge strategies in the same codebase
The LSM memtable uses sortedcontainers.SortedDict — the only external dependency across the four storage engine modules examined so far.
Every put() and delete() writes to the WAL before inserting into the memtable, ensuring no acknowledged write is lost on crash.
LSM WAL replay() silently discards any trailing partial record by checking remaining bytes at every parse step, making it tolerant of process crashes during append()
Each LSM WAL entry is exactly 8 + len(key_utf8) + len(value) bytes: two 4-byte big-endian length headers plus the raw payloads, with no checksum or operation-type field
The LSM tree's WAL (lsm.py:13-63) uses length-prefixed records with no CRC, relying solely on short-read detection for crash recovery; distinct from the separate lsm-wal-has-no-fsync issue
The LSM WAL (lsm.py:13-65) appends bare key-value pairs with no commit or transaction boundaries, safe only because structural changes (compaction, SSTable creation) use atomic file operations outside the WAL
The WAL class in log-structured-merge-tree/lsm.py calls flush() but never os.fsync(), making it strictly weaker than the standalone WAL which offers configurable sync modes
The WAL in log-structured-merge-tree/lsm.py uses only length-prefix framing with no CRC or checksum; a single bit flip in the length field causes silent data corruption or misaligned reads
The LSM WAL format has no checksums or magic bytes; it can detect truncation (incomplete length-prefixed records) but not corruption of existing bytes
The LSM tree's WAL (lsm.py:14-53) uses only length-prefixed framing with no checksum, making it unable to distinguish corruption from valid data unless a length field causes an out-of-bounds read
Constructing a new LSMTree on an existing directory replays the WAL to recover unflushed memtable state; this is validated by the crash recovery test which skips close() and reopens
The LSM tree's built-in WAL provides strictly weaker guarantees than the standalone WAL module along two independent axes: it has no CRC or checksum for integrity (vs per-record CRC32 in the standalone WAL), and its truncation zeroes the entire file instantly via wb mode (vs careful record-by-record rewriting), meaning corruption goes undetected and truncation is all-or-nothing rather than surgical.
The LSM tree's WAL.truncate() opens the file with "wb" mode which zeroes all content instantly, with no intermediate durable state to recover from if a crash occurs before the file is reopened for appending
The LSM tree at lsm.py:314 calls self._wal.truncate() with no sequence argument, indicating it uses a different WAL interface than the standalone write-ahead-log/wal.py module
If any open() or close() call within WAL.truncate() fails, the exception propagates uncaught and can leave self._fd holding a closed handle, breaking subsequent append() calls
WAL.truncate() uses a two-open sequence — open in "wb" mode to truncate the file, then reopen in "ab" mode for appending — because "wb" positions the cursor at offset 0 which is unsafe for a WAL
The WAL is truncated only after a successful SSTable write; if a crash occurs between SSTable.write and WAL.truncate, replay harmlessly re-inserts into the memtable (idempotent because it's a dict).
Sequential LWWRegister.set() calls without explicit timestamps produce monotonically increasing timestamps (auto-incremented), so ordering is preserved even without caller-supplied clocks.
When two LWWRegister writes have identical timestamps, the one with the lexicographically higher replica_id wins, making conflict resolution deterministic but arbitrary
LWWRegister.merge uses (timestamp, writer_id) tuple comparison, making conflict resolution a total order with no ambiguity since replica IDs are unique
On macOS/Darwin, os.fsync() does not guarantee flushing the disk write cache to stable storage; only fcntl(fd, F_FULLFSYNC) provides that guarantee, meaning all 13 fsync sites in the codebase may provide no power-loss durability on the development platform
On the development platform (Darwin/APFS), os.fsync() may not flush the disk write cache; true durability requires fcntl(fd, F_FULLFSYNC), which is never used at any of the 13 fsync call sites in the codebase.
On macOS (the development platform), os.fsync() may not flush the disk's hardware write cache; true power-loss durability requires fcntl(fd, F_FULLFSYNC) which is never used in the codebase.
The codebase uses os.fsync() exclusively, which on macOS/APFS does not flush the disk write cache; true durability requires fcntl(fd, F_FULLFSYNC) which is never used anywhere
When multiple records share the same join key value, all three strategies produce the cartesian product of the matching left and right groups
When left and right datasets share a non-key field name, mergerecords disambiguates with left and right prefixes; this convention applies in all join types including None-fill paths for left joins
Records missing the join key are silently dropped and counted in stats["skipped_records"] rather than raising exceptions, across all three join strategies
Mapper parallelism is simulated via round-robin chunking and mapperid tagging on output records; no threads or processes are used
The module implements exactly three join strategies (broadcast hash, partitioned hash, sort-merge) that all produce identical inner-join results when verified by comparejoinstrategies
All three map-side join strategies attach a mapperid field to each output record, enabling verification that partitions/mappers operated independently
Applying a combiner never changes final results; it only reduces intermediate record volume — the combiner must be semantically invisible
MapReduce in mapreduce.py writes intermediate data to JSON partition files between map and reduce phases, creating a full materialization barrier that prevents streaming from mapper to reducer
MapReduceJob.run() produces identical output regardless of nummappers/numreducers configuration for the same input data — the shuffle/partition step is deterministic
MapReduceJob.run() accepts either a list of (key, value) tuples or a file path string as input, with the file path handled transparently
job.stats tracks mapinputrecords, mapoutputrecords, reduceoutputrecords, worker counts, and elapsed time after each run
MapReduceJob defaults to strict mode where mapper/reducer exceptions propagate to the caller; fault_tolerant=True must be explicitly set to enable silent error skipping
Unbounded in-memory materialization is a cross-cutting concern from storage through batch processing: storage operations (LSM range scans, compaction, SSTable index rebuilds) materialize entire datasets, and pipeline stages (Count) silently accumulate all input before producing output, meaning the materialization problem exists at every layer of the data processing stack.
The maxkeysperpage parameter bounds entries per node so that serialized output fits within pagesize, with splits triggered before the limit is exceeded
mayberotate must be called under self.lock but does not acquire or assert the lock itself; all three call sites (append, appendbatch, checkpoint) satisfy this obligation.
The gossip membership lifecycle correctly detects and removes crashed hint-holding nodes, but this membership-layer correctness masks data-layer failure: hints stored on the crashed node are permanently lost with no recovery mechanism or error indication, and the cluster reports the failure as handled because the membership event was processed successfully.
The membership subsystem is unreliable at two independent levels that interact destructively: failure detection is both the single correctness bottleneck AND permanently miscalibrated (no arrival-rate history for adaptive thresholds), while membership and data convergence operate at fundamentally different rates (O(log N) membership gossip vs O(N) data propagation in ring topology), meaning incorrect liveness decisions propagate through membership faster than data can adjust to them.
The memtable_threshold parameter (lsm.py:202) directly caps the maximum number of WAL entries that must be replayed on crash recovery, because the WAL is truncated on every flush.
merge_sstables never deletes or modifies input SSTable files; cleanup is the caller's responsibility, and the current CompactionManager callers also do not delete old files from disk.
Every CRDT merge() method mutates the receiver in place and returns self, enabling chaining but meaning callers must deepcopy() before merging if the original state must be preserved
mergesstables writes directly to outputpath with no temp-file-and-rename pattern, so a crash mid-merge leaves a corrupt partial file at the target path.
mergesstables writes tombstones (value=None) to the output unless removetombstones=True is explicitly passed, and no current call site in CompactionManager enables removal.
mergesstables() accepts a removetombstones boolean from the caller rather than computing safety from level metadata, so correctness depends entirely on the caller passing the right value
Two trees with different leafcount but the same paddedsize can be diffed; positions where one tree has data and the other has EMPTY_HASH padding are reported as diffs
MerkleTree.diff compares two trees in O(log n) by short-circuiting at matching subtree hashes, providing the divergence-detection layer that would feed receivereplica with only changed keys rather than a full key scan
Merkle tree diffs produce sorted leaf indices that map directly to key ranges when the tree is built with key-sorted input, enabling anti-entropy repair that transfers only divergent key ranges rather than full dataset comparisons.
diff() excludes padding indices beyond max(self.leafcount, other.leafcount), using max (not min) so positions where one tree has data and the other has EMPTY_HASH padding are still reported as diffs
diff() and diffrecursive are read-only — they mutate neither tree, allocate only a local result list, and produce no side effects
Differing leaf indices returned by diff() are in ascending order as a consequence of left-to-right DFS traversal in diffrecursive
MerkleTree.diffrecursive() returns immediately when subtree hashes match, making the diff cost proportional to the number of divergent keys rather than total tree size — the logarithmic efficiency that makes anti-entropy practical.
MerkleTree.diff() returns a list of integer leaf indices where the two trees diverge, not subtree nodes or hash pairs
getproof bounds-checks against leafcount (real leaves), not padded_size, so proofs cannot be generated for padding slots even though they have valid hashes in the backing array
Internal node hashes are computed over concatenated 64-char hex digest strings (128 ASCII bytes) rather than raw 32-byte SHA-256 digests, resulting in 4x more data per hash input.
The direction field in each proof sibling tuple indicates where the sibling sits (left or right), not the current node — this controls hash concatenation order during verification
Merkle proofs are point-in-time snapshots; a proof generated before update_leaf will fail verification against the new root hash, and nothing invalidates the stale proof in-place
A proof for a tree with paddedsize = 2^h contains exactly h sibling entries, one per tree level excluding the root
getproof returns siblings ordered from leaf level to root level (bottom-up), and verifyproof consumes them in that same order
A Merkle proof contains exactly height sibling hashes where height = log2(nextpowerof2(leafcount)), making both proof size and verification cost O(log N).
The tree uses 0-indexed implicit array layout where children of node i are at 2i+1 and 2i+2, and leaves start at index padded_size - 1
The Merkle tree implementation avoids serialization canonicalization entirely by accepting caller-provided bytes and hashing them directly via hashlib.sha256(data), making it immune to the JSON encoding ambiguity that affects the PBFT digest.
A MerkleTree with 4 leaves has height == 2, indicating height counts internal levels (log₂ of padded leaf count)
The implementation handles non-power-of-2 leaf counts (tested with 1 and 3 leaves) via internal padding to the next power of 2 using EMPTY_HASH
The Merkle tree uses SHA-256 (merkletree.py:11) for tamper-evident cross-replica verification via verifyproof, while all storage engines use CRC32 — reflecting the split between accidental-corruption detection on trusted local disk and adversarial-integrity across trust boundaries
The metadata payload is exactly 20 bytes (5 big-endian uint32s packed as '>5I'), zero-padded to pagesize; the format requires pagesize >= 20 but this is not enforced.
The B-tree metadata (root, height, totalkeys, nextfree, freehead) is always stored at file offset 0 in page 0, which is reserved as the superblock and occupies exactly pagesize bytes.
The test testvnodecountaffectsbalance asserts that imbalance at 500 vnodes is strictly less than at 1 vnode, and the statistical model (variance ∝ 1/V) guarantees monotonic improvement.
The combiner runs inside runmapper on each mapper's output for a single partition independently; it never sees data from other mappers or other partitions.
When fault_tolerant=True, mapper and reducer exceptions are caught and silently skipped with no logging or error reporting, producing partial results without any indication of data loss.
runreducer sorts all intermediate pairs by key before calling itertools.groupby, which is required because groupby only groups consecutive elements with the same key.
MapReduceJob assigns intermediate keys to reducer partitions using hash(key) % num_reducers, ensuring all values for a given key reach the same reducer.
Intermediate shuffle data is serialized as JSON files in a temp directory with naming convention map-{mapper_id}-part-{partition}.json.
Conflicts are recorded in conflict_log with metadata including the key and the ConflictStrategy enum value that resolved them.
ConflictRecord stores both localvalue and remotevalue along with the key and resolution strategy, providing a complete audit trail for every conflict resolution.
allconverged() compares (value, timestamp, istombstone) but intentionally ignores originnodeid (tuple index 2), since different arrival orders can record different origins for the same resolved state
Custom merge functions are applied once per conflict; repeated sync rounds on already-converged values must not re-trigger the merge, preventing bugs like counters doubling on every sync cycle.
Custom merge resolution creates a new timestamp (max(localts, remotets) + 1) and canonical origin so the merged result supersedes both conflicting inputs in any future LWW comparison
Constructing a cluster with CUSTOMMERGE strategy and mergefn=None is accepted silently, but raises TypeError at the first actual conflict during sync()
Each (key, timestamp, originnode) triple is applied at most once per node, tracked by the seen set, preventing duplicate application in ring topologies where changes propagate through multiple hops
Successive put() calls on the same ReplicaNode yield strictly increasing Lamport timestamps.
LWW conflict resolution compares (timestamp, node_id) tuples using Python's lexicographic tuple comparison, guaranteeing all nodes independently reach the same winner without coordination
Multi-leader replication stores tombstones indefinitely with no compaction or garbage collection method, which is safe but accumulates dead entries without bound
RING topology requires at least N−1 sync() rounds to fully propagate a single-source change across N nodes, because each round advances the change by exactly one hop
Ring topology requires at least 2 sync rounds to propagate a write across 3+ nodes, unlike all-to-all topology which converges in a single round; syncuntilconverged() loops to handle this.
sync() drains all pending queues from every node before distributing any changes, preventing intra-round cascading where one node's change would trigger further propagation within the same sync round
The count returned by sync() includes idempotent no-op deliveries (changes already in the target's _seen set), so it reflects replication attempts, not accepted changes
Multi-leader sync is designed for safe convergence: the two-phase collect-then-distribute pattern prevents intra-round cascading, custom merge functions are required to be idempotent across repeated syncs, and merged results receive fresh timestamps ensuring they supersede both inputs.
Replication occurs only when sync() is explicitly called; there is no background replication thread or event-driven propagation, giving tests full control over replication timing.
Deletes are implemented as tombstone writes (TOMBSTONE sentinel with istombstone=True) that replicate like normal mutations; get() returns None for tombstoned keys
The B-tree, garbage collector, and consistent hash ring all assume single-threaded access without locks, assertions, or documentation — a systematic pattern where concurrency unsafety is implicit rather than enforced.
MVCCDatabase.begintransaction captures activeat_start as a frozen set of all currently active transaction IDs at that moment; it is never modified after creation and serves as the transaction's immutable snapshot boundary
Active (uncommitted) transactions cannot survive a process restart; the safe recovery strategy is to treat them as aborted, which is consistent with isvisible's existing handling of aborted transaction versions as invisible
Writers never modify existing Version objects (except self-overwrites); new values are appended to per-key version lists, preserving the immutable snapshot for concurrent readers
nexttxid and nexttimestamp must never reissue a previously used value; if they reset to 1 on restart, visibility comparisons like createdby < tx.tx_id silently produce wrong results by conflating old and new transaction IDs
MVCCDatabase has zero serialization or persistence support; all state (versions, transactions, committed, aborted, counters) lives only in memory with no todict, fromdict, or file I/O of any kind
The codebase implements MVCC visibility (ssidatabase.py:visiblevalue, mvccdatabase.py:Version) at the row/key level but not at the SSTable-file level where LevelDB uses Version reference counting for concurrent reader isolation
MVCC visibility is enforced through three complementary invariants: append-only version storage prevents lost updates, own-writes are always visible regardless of commit state, and deletions follow the same visibility rules as writes — collectively ensuring snapshot consistency.
The visibility check createdby < tx.txid assumes transaction IDs are monotonically increasing to mean "started before us"; non-monotonic or reused IDs would break snapshot correctness.
In MVCCDatabase, uncommitted transaction writes exist only as in-memory Version objects in versions and never reach a persistent data store; abort works by adding the txid to _aborted so visibility checks filter it out
isvisible requires committed, aborted, and each transaction's activeatstart set; persisting version chains without this metadata makes them unreadable after restore since visibility cannot be determined
The tester/test split uses at least three naming patterns across directories: testertest*.py, test*tester.py, and testtester*.py
Neither Bitcask implementation supports per-key TTL or expiry; hash-index-storage stores a timestamp in the header but never checks it for expiration, and log-structured-hash-table omits timestamps entirely.
Neither partitioning strategy offers both reliable routing and ordered access: range partitioning preserves key ordering but depends on maintaining parallel arrays in lockstep with invariants that are never verified (boundaries/partitions correspondence, adjacency assumptions in merge), while hash partitioning provides deterministic single-lookup routing but permanently destroys lexicographic key order, making range queries impossible.
Network partitions create a force multiplier on distributed write correctness gaps: partitions disrupt gossip-based failure detection (the single correctness gate for replication, read repair, and hinted handoff), while writes that do proceed operate under already-weakened quorum semantics (sloppy quorums count hints as successes, sub-quorum configurations accepted with only a warning).
The nextfreepage metadata field only ever increases — it is the high-water mark of file extension, never decremented even when pages are freed; file size never shrinks, only interior pages are recycled via the free list.
No SSTable writer or compaction routine in any implementation uses the write-temp/fsync/rename pattern; all write directly to the final file path
No CI pipeline (no .github/ directory), Makefile, or orchestration script exists to invoke testertest*.py files; they are run manually via python testertest*.py
None of the three storage engines (hash-index, log-structured-hash-table, LSM) use a manifest or compaction log to make segment replacement atomic; segment discovery is purely filesystem-based via directory listing.
Neither LSM implementation throttles compaction I/O throughput; a large merge will saturate disk bandwidth without regard to concurrent foreground reads or writes.
Every binary record reader in this codebase (WAL, both Bitcasks, LSM WAL, SSTable, B-tree WAL) stops at the first CRC failure, short read, or parse error; none attempt to scan forward for the next valid record boundary
The entire B-tree implementation contains zero locking, latching, mutex, or thread-safety constructs; it is strictly single-threaded with no concurrent-access support
Neither truncate nor compactpartition holds a lock; concurrent calls on the same partition would race on both partitions and baseoffsets, assuming single-threaded execution.
The ddia-implementations repo contains zero conftest.py files at any directory level — confirmed by exhaustive grep across all 37 modules, not just root
The standalone WAL module provides no mechanism or documentation for enforcing the required WAL-before-data fsync ordering; callers must implement the fsync(wal) → write(data) → fsync(data) protocol themselves.
The repository contains no cuckoo filter implementation; the counting Bloom filter is the only deletion-supporting probabilistic filter, leaving the saturation problem unsolved
The repository has no cuckoo filter, quotient filter, or xor filter implementation; CountingBloomFilter is the only probabilistic membership structure supporting deletion.
No data path through the system is trustworthy: the primary storage path through compaction is unsalvageable (crash-unsafe within a broken durability pipeline, no concurrent access protection) while the secondary derived-data path through event infrastructure is unreliable in both content semantics and ordering guarantees, leaving no channel through which data can be read or propagated with confidence.
The system has no defense-in-depth against data corruption: input validation is systematically absent at API boundaries (malformed data enters freely), and corruption is terminal across all readers with no resync capability (corrupted data can never be recovered) — the system is maximally permeable to corruption entry and maximally fragile to corruption presence.
The BTree class never calls pm.writepage directly for data pages; all data page writes are routed through walwritepage (or walwrite_meta for page 0) to maintain the write-ahead invariant.
No implementation in the codebase calls os.fsync() on a directory file descriptor; all fsync calls target data file descriptors only
On-disk files created by storage-engine modules (B-tree pages, WAL segments, SSTable files) are not covered by .gitignore, relying on test teardown or manual cleanup
Calling freepage twice on the same page creates a cycle in the free list, which allocatepage cannot detect, leading to silent data corruption through repeated page reuse.
Both SSTable implementations write variable-length records contiguously with no fixed-size block structure, making mid-file resynchronization after corruption impossible — a corrupted length field causes the reader to consume all subsequent valid records as garbage.
No node stores a high-key (upper-bound fence key); searchers cannot detect mid-split state, which would be required for Lehman & Yao's concurrent search protocol
Nodes store no upper-bound "high key," so a reader cannot detect that a concurrent split moved keys to a right sibling; this is the second missing prerequisite for Lehman-Yao's single-latch protocol
Neither the Raft nor TOB implementations include leader lease or ReadIndex optimizations; all read and write operations pay full consensus cost.
The B-tree has no allocation bitmap, page-reachability check, or post-recovery consistency scan to detect leaked or orphaned pages; once a page is lost between the tree and the free list, it is silently unrecoverable.
No file in the ddia-implementations codebase uses ODIRECT, os.open(), or any os.O flags; all I/O goes through Python's buffered open() and the kernel page cache
There is no operational path to system correctness: the system has no stable regime at any scale (storage degrades monotonically, failures widen the correctness gap, no paradigm is both scalable and self-healing), AND even if a stable state existed, correctness could not be verified (the gap between specification and implementation widens under every failure mode and testing validates the wrong model).
Neither serializeinternal nor serializeleaf checks whether the serialized output fits within pagesize; overflow prevention is the caller's responsibility via the maxkeys limit in BTree._insert.
The repository contains no pyproject.toml, pytest.ini, setup.cfg, or root conftest.py; pytest runs with pure default settings
No file in the repository implements reference counting on SSTable files, Version objects, or any storage-layer snapshot structure
The system has no safe operating mode in any scenario: it is unsafe under concurrent access (both read and write paths lack synchronization across core components) AND unsafe during crash recovery (no storage engine has a fully safe recovery path), meaning correctness is unachievable whether the system is running normally under load or recovering from failure.
The batch pipeline performs no validation that adjacent stages produce/consume compatible record formats; mismatched tuple shapes fail at runtime with indexing errors
The codebase contains no copy-on-write or shadow paging implementation; all three storage engines use write-ahead logging exclusively for crash recovery
NO_SIBLING = 0xFFFFFFFF marks the end of the leaf sibling chain; a leaf with this value is the rightmost at its level
No storage paradigm offers both scalability and structural resilience: hash indexes are memory-bound, LSM miss-probes are linear due to missing Bloom integration, and all paradigms degrade monotonically after crashes with no safe recovery path (crash-unsafe compaction, incomplete CRC, replay without batch atomicity).
The codebase has no ring, star, or mesh topology modes for gossip; peer selection is always uniform random from all active nodes.
No subsystem at any architectural tier has a viable recovery strategy: storage-layer recovery is paradoxically over-engineered in infrastructure and under-implemented in usage (WAL builds complete but unused sequence/checkpoint machinery, with no safe crash recovery path), while application-layer event sourcing lacks durable persistence and uses conflated ID spaces — recovery fails both where infrastructure exists but is unused and where it was never built.
The partitioned log provides no mechanism to atomically commit consumer offsets together with processing side effects, ruling out exactly-once delivery without external idempotency
The entire repository uses scalar Lamport clocks for ordering; no vector clock or version vector implementation exists, so true concurrency detection (distinguishing causal order from concurrent writes) is absent
In none sync mode, dosync() is a complete no-op for individual appends; data reaches stable storage only via WAL rotation or explicit close
The B-tree is a classical single-threaded B⁺-tree, not a Lehman-Yao B-link tree — internal nodes lack right-link pointers, there are no high keys, and no latch coupling; leaf sibling pointers exist solely for range scan traversal.
Consumer tracks offsets as a dict[tuple[str, int], int] mapping (topic, partition) to offset, so commit granularity is per-partition, not per-message
Non-primary nodes do not update currentview in handleviewchange; they wait for the NEW_VIEW message to adopt the new view, leaving them in a liminal state that rejects normal-phase messages for both old and new views.
Every DocumentPartitionedDB and TermPartitionedDB operation returns an OperationResult with a partitions_touched field, making read/write amplification cost observable and directly comparable between strategies.
Ordering and position infrastructure is broken or volatile at every layer: WAL sequence numbers are diligently computed and stored but never consulted during recovery, event sourcing conflates stream-scoped and global position IDs creating addressing ambiguity, and CDC consumer positions exist only in memory and are lost on restart — no ordering mechanism in the codebase reliably serves its intended purpose.
The codebase uses three incompatible time/ordering models with no unifying framework: Lamport clocks provide total order via node-ID tiebreaking, vector clocks provide partial order with explicit concurrency detection, and wall-clock timestamps provide no causal guarantees and are not even monotonic.
Concurrent add and remove of the same element in ORSet resolves in favor of the add, because the add's unique tag was not in the remover's tombstone snapshot
The ORSet implementation uses no version vector, dot context, or causal summary to compress tombstone state; each removed tag is stored individually as a (replica_id, seq) tuple in the tombstone set
ORSet.remove() returns False for a non-existent element rather than raising an exception, treating removal of absent elements as a no-op with a boolean status indicator.
ORSet tombstones create an inherent resource leak: they must be retained indefinitely for merge correctness (removing them causes deleted elements to reappear when merging with replicas that still hold the original tags), yet they grow monotonically with no compaction or garbage collection, meaning memory consumption increases without bound over the set's lifetime proportional to total remove operations.
Dropping ORSet tombstones would cause removed elements to reappear when merging with a replica that still holds the element's active tags — the tombstone set in merge() at line 165 suppresses stale tags via set difference
ORSet._tombstones is append-only: tags are added during remove() and unioned during merge() but never deleted, so the tombstone set grows without bound over the lifetime of the replica
ORSet._tombstones is only ever unioned during merge() and added to during remove(); there is no compaction or garbage collection, so the tombstone set grows without bound
The merge set for leveled compaction is always exactly one source SSTable plus all SSTables in the target level whose key range intersects it, constructed via _overlapping() at sstable.py:428-429.
overlapping performs no I/O, mutation, or side effects — it operates entirely on in-memory metadata (minkey, max_key) cached during SSTableReader construction.
overlapping determines SSTable key-range overlap using the standard interval overlap test (a.minkey <= b.maxkey AND a.maxkey >= b.min_key) with lexicographic string comparison matching the SSTable sort order.
A transaction always sees versions it created itself (createdby == tx.txid), regardless of commit status, ensuring read-your-own-writes consistency within a transaction.
Non-power-of-2 leaf counts are padded with EMPTY_HASH (SHA-256 of empty bytes) to form a complete binary tree; these padding nodes appear in proof paths but do not represent real data.
PageManager defaults to 4096-byte pages matching the OS page size and common disk sector size, but this is a coincidental match with O_DIRECT alignment requirements — the implementation uses buffered I/O through Python's open() and never enforces alignment
PageManager calls flush() on every writepage and write_meta call but only calls os.fsync() in sync() and close(), so individual page writes are not crash-durable without the WAL layer
readmeta() does not increment pagesread, so metadata access is invisible to the I/O statistics reported by BTree.stats
B-tree metadata has a durability gap at both lifecycle points: initial creation writes metadata directly without WAL protection, and all subsequent updates flush but never fsync, meaning metadata can be lost or inconsistent after any crash regardless of when it occurred.
The page size is not persisted in the data file; reopening with a different page_size value silently corrupts all page reads with no error or detection mechanism
walwritepage pads data to exactly pagesize with null bytes before passing it to both wal.logwrite and pm.writepage, ensuring the WAL record and data file page are byte-identical.
PageManager defaults to 4096-byte pages, matching typical filesystem block size; this sets the hard upper bound for serialized leaf and internal node content.
Page 0 is always the metadata page and must never be written directly — only through writemeta; user data pages start at page 1, with the initial empty leaf root always at page 1
loadhint_file in both Bitcask implementations treats a short read as end-of-file (breaks out of the read loop without error), silently dropping all keys that would have appeared after the truncation point
Participant.recover() returns a list of in-doubt transaction IDs but cannot commit or abort them; only the coordinator holds the decision, so participants remain locked until Coordinator.recover() re-sends it.
Network partitions create cascading failures across the membership and consensus layers: gossip-based failure detection is the single correctness gate for replication, and Raft's partition hazard (stale leader silently accepting doomed writes, forced re-election on rejoin) depends on membership accuracy that gossip's timeout-based suspicion may misjudge.
PartitionedHashJoin.join() calls partition_dataset() on both inputs before joining, meaning partitions are computed at join time rather than assumed from storage layout
Log compaction retains only the last occurrence per key but always preserves messages with key=None.
Messages with the same key always route to the same partition via MD5(key) % num_partitions, guaranteeing per-key ordering within a partition.
Offsets within a partition are strictly increasing and never reused; baseoffsets only advances forward via truncate and compact_partition.
Topic._init_ enforces partition count between 1 and 128, raising ValueError outside that range.
When persistdir is set, messages are appended as JSONL files named {topic}{partition}.jsonl and committed offsets are written to a single offsets.json.
ConsumerGroup.rebalance() is synchronous and triggered on every membership change, distributing partitions round-robin across sorted consumer IDs — no incremental or cooperative protocol.
All three storage engine WAL implementations share a systematic integrity blind spot: CRC checksums cover only data payloads, leaving routing metadata (sequence numbers, page numbers) unprotected against silent corruption.
PBFTCluster raises ValueError if n is not exactly 3f+1 or if more than f byzantine nodes are specified, enforcing the minimum redundancy required for Byzantine fault tolerance.
checkprepared logs the node's own COMMIT into messagelog and phasesenders before returning it for broadcast, so duplicate detection correctly rejects re-delivery of the node's own COMMIT.
The default=str parameter in the PBFT digest function silently converts non-serializable objects to their str() representation, which is not deterministic across replicas for types like datetime or custom objects — a latent consensus-breaking bug if non-primitive request payloads are introduced.
PBFTNode.receive_message returns an empty list for duplicate or invalid messages (out-of-range sender IDs, already-seen tuples) rather than raising exceptions, consistent with Byzantine protocol design where you can't trust the sender.
Honest nodes execute requests with contiguous sequence numbers starting from 1, enforced by the protocol's ordering guarantees and verified by test assertions on the executedlog.
A (view, seq) pair can transition to "prepared" at most once per node, enforced by the preparedrequests set guard at the top of check_prepared.
checkprepared always calls checkcommitted after marking prepared, enabling immediate commit if enough COMMIT messages arrived out of order before the node reached the prepared state.
checkprepared requires exactly 2f matching PREPAREs (not 2f+1) because the primary's PRE-PREPARE counts as implicit agreement toward the quorum of 2f+1.
The primary sends PRE-PREPARE but not PREPARE; handleprepare rejects PREPAREs from the primary, and the primary's code path in handlepre_prepare skips sending a PREPARE — this asymmetry is why the prepared threshold is 2f not 2f+1.
byzantine-fault-tolerance/pbft.py:46 is the only call site in the codebase that uses json.dumps(sort_keys=True) for hashing; all other json.dumps calls are for storage serialization where canonical form is irrelevant.
PBFTCluster.submit_request runs the full three-phase protocol to completion deterministically in a single call (no real networking or async), returning a boolean success indicator.
Requests that reach the prepared state survive view changes: the new primary collects prepared-but-uncommitted requests from VIEWCHANGE messages and re-proposes them with fresh PREPREPARE messages in the new view.
PBFT view changes maintain both safety (requiring 2f+1 messages before acting) and liveness (carrying prepared-but-uncommitted requests into the new view), matching the theoretical protocol guarantees.
Every accepted remote change (including synthesized merge results) is appended to _pending for downstream propagation, enabling multi-hop replication in ring topologies where a single sync round cannot reach all nodes.
Every accepted remote change in applyremotechange is appended to _pending, causing it to be forwarded to the next node in the replication topology during the next sync cycle
Independent per-record CRCs in wal.py, btree.py, and bitcask.py validate each record in isolation, so record deletion, reordering, or splicing produces a file where every CRC passes but the log is semantically corrupt
The event is appended to self.events (in-memory) before persist_event writes it to disk; a disk write failure leaves the in-memory store ahead of the on-disk log with no rollback.
Disk persistence via persistevent happens before subscriber notification: an event is written to the NDJSON file before any LiveProjection or other subscriber sees it.
persistevent writes to the OS buffer via file.write() but never calls fsync; event data can be lost on crash between write and OS flush to disk.
The persist file is opened in append mode and closed on every single persistevent call, not held open across the store's lifetime — safe but incurs per-event open/close overhead.
Pipeline execution is pull-based via nested Python generators; records flow only when the terminal consumer iterates, and stages interleave execution without threads
Pipeline stages pass untyped tuples with no validation that the output shape of one stage matches the input expectations of the next; a mismatched stage composition fails at runtime, not at construction time
Pipeline stages produce and consume lists of (key, value) tuples, where key is a string and value depends on the stage (string for Tokenize, int for Count).
Pipeline.run() materializes all output records into a list, while run_lazy() returns an iterator that yields records on demand without full materialization.
Pipeline stages have fundamentally different resource profiles hidden behind a uniform tuple interface: Count is an unbounded memory barrier that materializes all input before emitting any output, Sort bounds memory via external merge-sort disk spillover, yet both produce and consume identical tuple lists with no way for the pipeline to distinguish streaming stages from blocking ones.
PNCounter state is structured as two GCounter sub-counters (p for increments, n for decrements) with value() = p - n, confirming the standard decomposition from CRDT literature.
PNCounter delegates entirely to two GCounter instances (p for increments, n for decrements); it contains no independent merge or counting logic
Every get must descend to a leaf node regardless of tree height since internal nodes hold no values; cost is exactly *h* page reads for a tree of height *h*
A crash during compaction produces irrecoverable data loss: no implementation uses write-temp/fsync/rename for atomicity (Bitcask deletes before renaming, LSM lacks atomic rename), and the resulting file corruption is terminal because every reader halts at the first CRC failure with no resync or skip capability.
PostgreSQL's nbtree never merges or redistributes siblings on delete; it marks tuples dead and relies on VACUUM to reclaim space, trading immediate space efficiency for throughput under concurrency
If the predicate function throws on a (key, value) pair in read_predicate, that pair is silently excluded from results via bare except Exception: pass — no logging, no error propagation
get(pk) reads directly from partitions[pid].documents and is never affected by asyncindex mode — staleness only applies to querybyfield which reads from globalindex
Partitioned log Producer hashes keyed messages to a fixed partition for ordering guarantees, and round-robins unkeyed messages across partitions for load distribution
catchup advances position for every event, not just those with registered handlers, so unhandled event types don't create replay gaps
catchup uses fromposition=self.position + 1 arithmetic that assumes eventid values are contiguous integers with no gaps; IDs with gaps would cause events to be silently skipped
If a handler raises during catchup, position is not advanced past the failing event, causing all subsequent catch_up calls to re-encounter and re-fail on the same event indefinitely
A Projection's _state is a cache of derived data, not a source of truth — it can be fully reconstructed from the event log at any time by replaying from position 0
The snapshot interval counter increments for every event processed during catch_up, including those without handlers, so snapshot frequency is based on total event throughput not just handled events
Distributed protocol safety claims are unfalsifiable under the current testing methodology: specific protocol gaps (2PC's blocking window with unused timeout, Raft's stale-leader writes) AND general protocol behavior are validated exclusively under synchronous simulation with deterministic delivery — safety violations that manifest only under asynchronous or partitioned delivery are structurally invisible to the test suite.
Distributed protocol safety properties are validated exclusively under synchronous simulation (deterministic tick-based delivery, no real network I/O), but the most critical failure mode — network partitions creating stale-leader write acceptance and forced re-elections — is inherently asynchronous, creating an untested gap between modeled and real-world safety.
put never modifies existing on-disk data; overwrites create a new record appended to the active segment and update only the in-memory index, leaving old records as reclaimable garbage until compaction
put does not reject the TOMBSTONE sentinel as a value; passing it creates a record indistinguishable from a delete, which silently corrupts the key's state on recovery
ReadRepairStore never checks the boolean return from Replica.put() during read repair or writes, relying on version monotonicity to make stale or redundant writes harmless no-ops.
The test*.py files are consistently longer than their testertest_*.py counterparts (e.g., 207 vs 125 lines for bloom-filter), containing additional edge case tests beyond the tester's core invariant checks
Tests must be executed from within each module's directory (e.g., cd bloom-filter && pytest) because imports use bare module names with no package prefix or installed package
Each BitcaskStore instance independently rebuilds the full in-memory index from disk; there is no shared-memory mechanism to preserve or share the keydir across instances even within the same process, unlike Erlang's ETS-backed keydir
Neither Python Bitcask implementation has reference counting, reader registration, or file-handle locking — compaction can delete segment files while a concurrent reader holds a stale index entry, unlike Erlang Bitcask which defers deletion until the last reader's refcount drops to zero.
The distributed storage layer systematically weakens quorum semantics: sloppy quorums count hint storage as successful writes, and sub-quorum configurations (R+W<=N) produce only warnings rather than errors, making it possible to operate without intersection guarantees.
When R + W <= N, the ReadRepairStore constructor emits a warnings.warn rather than raising an exception, allowing intentionally relaxed quorum configurations.
advancecommitindex is invoked both on heartbeat ticks and immediately after successful append responses update match_index, allowing eager commit advancement without waiting for the next heartbeat cycle.
RaftCluster uses tick-based deterministic simulation with no real clocks, threads, or network I/O; time advances only via explicit tick(ms) calls, making partition and election scenarios fully reproducible.
commitindex only ever increases within advancecommitindex because the loop starts at commit_index + 1 and only assigns forward; it can never decrease or be reset.
An entry is committed when count > total // 2, meaning for a 5-node cluster at least 3 replicas (including the leader) must have the entry, matching Raft's quorum requirement.
advancecommitindex only commits entries whose term matches current_term, implementing the Raft safety property from §5.4.2 that prevents committing prior-term entries by replica count alone.
RaftNode.client_request() on a follower returns {"success": False, "error": "not leader"} as a structured dict rather than raising an exception; only the leader accepts client writes.
The Raft implementation in raft-consensus/raft.py embodies the Multi-Paxos leader-lease pattern: one election per term via becomeleader, then Phase-2-only replication (AppendEntries) for all entries within that term — heartbeats act as the lease
Log entries are accessed by log index as a direct Python list index (self.log[prevlog_index]), which assumes all entries from index 0 onward are present — log truncation would require an offset or a different data structure
self._log in RaftNode is append-only: no method ever removes entries, so the log grows monotonically with client requests and memory usage is unbounded
Peers missing from matchindex are treated as having replicated nothing via dict.get(peer, 0); becomeleader initializes all peers to 0, so this default is safe but allows graceful handling of unexpected state.
A partition containing fewer than a strict majority of nodes cannot elect a leader; get_leader() returns None for a 2-of-5 minority partition.
client_request returns {"success": False, "error": "not leader"} if the node is not the leader; there is no automatic forwarding to the current leader.
The Raft implementation has no PreVote, leader lease, or CheckQuorum mechanism; split-brain safety relies entirely on term numbers and the single-vote-per-term invariant
lastapplied is tracked but never used to actually apply commands to a state machine, making snapshotting impossible without first implementing application logic
Network partitions create a compound safety hazard in Raft: the isolated leader silently accepts writes that can never commit, and when it rejoins its inflated term forces a disruptive re-election of the healthy partition's leader.
Network partitions are simulated by adding node IDs to _partitioned; partitioned nodes neither tick nor send/receive messages, modeling complete network isolation.
The only mechanism preventing simultaneous candidacy is the randomized election timeout range (default 150–300ms); there is no protocol-level tiebreaker such as PreVote
A partitioned node that rejoins after repeated failed elections carries an inflated term number, forcing the healthy leader to step down via becomefollower and triggering a completely unnecessary cluster-wide election
The Raft log is initialized with a sentinel LogEntry(term=0, index=0, command=None) at position 0, and multiple methods depend on this entry always being present at self._log[0]
The log is initialized with a sentinel LogEntry(term=0, index=0) at position 0 that is never removed, eliminating empty-log edge cases in lastlogindex() and lastlogterm().
RaftCluster.tick() collects outbound messages and delivers both the request and its response within the same tick invocation — there is no simulated network latency.
A partitioned leader continues accepting client requests (returning success: True) even though those entries can never commit and will be overwritten when the partition heals — the gap a leader lease would close
Each new leader election produces a strictly higher term than the previous leader; terms never repeat or decrease across leader transitions, as validated by the testtermsmonotonically_increasing test.
Uncommitted log entries from a deposed leader are overwritten when the node receives AppendEntries from a new leader with conflicting entries at the same index; only committed entries are durable.
mergesmallpartitions() only merges adjacent partitions when their combined size stays below minpartitionsize, preventing immediate re-splitting after merge.
mergesmallpartitions sweeps left-to-right and greedily merges adjacent pairs whose combined size is at or below minpartitionsize, which can produce different results than an optimal merge strategy
RangePartitionedStore uses return-value signaling: get() returns None for missing keys and delete() returns False for absent keys, rather than raising exceptions.
Range partitioning has its own skew risk: keys arriving in sorted order all hit the rightmost partition; Partition.split() mitigates this for data already stored by splitting at the median, but cannot prevent temporary hotspots during sequential ingestion
Partition.split always divides at the median key (index len // 2), which can produce uneven partitions when key distribution is skewed
Splits are triggered automatically during put() when a partition exceeds maxpartitionsize, but merges require an explicit call to mergesmallpartitions() and never happen implicitly.
Key routing uses bisect_right(boundaries, key) - 1 on a parallel sorted list of partition start keys, giving O(log p) partition lookup where p is the partition count.
Partitions use half-open intervals [startkey, endkey) with the first partition starting at "" and the last having end_key=None, guaranteeing the entire string keyspace is covered with no gaps or overlaps.
Partition.merge(other) appends the other partition's keys without re-sorting; correctness depends on the caller passing the immediate right neighbor, with no runtime adjacency validation.
No concurrency protection exists on any mutation path (put, delete, split, merge); concurrent access would corrupt the sorted key/value lists.
partitions and boundaries are kept in lockstep as parallel arrays — index i in both refers to the same partition — trading O(n) insert/delete for O(1) indexed access and O(log n) binary search routing.
Range partitioning routing depends on maintaining strict lockstep invariants across parallel data structures that are never verified: boundaries and partitions are kept as parallel arrays where any desynchronization silently misroutes keys, merge operations assume partition adjacency without checking, and splits always divide at the median regardless of key distribution skew.
Partition.split() divides at len(keys)//2 (count-based median), not the lexicographic midpoint of the key range, producing equal-count halves but potentially unequal key ranges.
rangescan(startkey, endkey) treats endkey as an exclusive upper bound — range_scan("k05", "k10") returns keys k05 through k09
rangescan walks the leaf-level linked list via nextsibling pointers rather than re-descending the tree, giving O(height + leaf pages in range) page reads.
BTree.rangescan(startkey, endkey) returns keys in the half-open interval [startkey, endkey) — start-inclusive, end-exclusive; endkey=None scans to the last key.
Range scans are vulnerable to three independent failure modes: concurrent modifications produce inconsistent results (no snapshot isolation), corrupted sibling pointers can cause infinite traversal (no cycle guard), and large result sets consume unbounded memory (eager materialization into a list).
range_scan iterates sources oldest-to-newest (SSTables, then immutable memtables, then active memtable) so that dict overwrites naturally give each key the value from the newest source
range_scan collects all matching (key, value) pairs into a list in memory rather than yielding lazily; large ranges consume proportional memory.
rangescan() combines results from both the active memtable and all SSTables in self.sstables, making it sensitive to mutations of either data structure during iteration
The range scan merge uses an incrementing priority counter where higher priority = newer source; the dict naturally deduplicates by overwriting older entries for the same key
range_scan can observe inconsistent state if a flush or compaction runs concurrently, since there is no snapshot isolation or locking
rangescan has no visited-set or cycle detection on the leaf sibling chain; a corrupted nextsibling pointer would cause an infinite loop.
Partition.rangescan is a pure read operation with no side effects — it never modifies keys, _values, or any other state, and returns a fresh list the caller can mutate freely.
Passing end=None to Partition.rangescan sets the right boundary to len(self.keys), scanning from start through the last key with no upper bound.
The priority integer stored in the merged dict during range_scan is never read or compared; correctness relies entirely on the source iteration order and dict overwrite semantics, making the priority field dead weight
range_scan has no side effects: it does not mutate tree state, trigger flushes, or modify the WAL
rangescan correctness depends on keys being sorted, an invariant maintained by put/delete/split/merge but not enforced structurally — direct mutation of _keys bypassing these methods silently produces wrong results.
Calling rangescan(startkey) without an endkey scans from startkey through the last key in the rightmost leaf, terminating at the NO_SIBLING sentinel
_reaches uses is (object identity) to find the source event, not ==, because dataclass equality would conflate distinct events with identical field values.
The causal graph has two edge types: parent (same-node sequential) and cause (cross-node send→receive), and _reaches must follow both to correctly determine happens-before.
Both get() (line 254) and rangescan() (line 286) check immutable_memtables in the correct order, indicating the read path was designed for a concurrent-flush architecture that the write path never implements
The complete read path from storage through derived systems is unreliable at every stage: the SSTable layer has compounding integrity and performance deficiencies (no per-entry checksums, linear miss-probes from missing Bloom integration, redundant key-list rebuilds per lookup) while the CDC backbone feeding all derived systems depends on reconstruction heuristics for event semantics and requires callers to manually flush and provide old values for consistency.
read_predicate records dependency graph edges only for keys in the committed snapshot (k in snap), not for keys the transaction itself created via write() — avoiding self-dependencies
readpredicate appends a (predicate, dict(result)) tuple to tx.predicate_locks — a deep copy so commit() can compare against the original result even after database state changes
DynamoCluster.get() read repair pushes the highest-versioned value to stale replicas, so a naive delete (removing from _store) is undone when a recovering node still holds the old value
Read repair in get() only fixes replicas that were part of the read quorum, not all replicas; full-cluster repair requires a separate call to antientropyrepair().
Topic.read finds the start position by scanning for msg.offset >= offset rather than computing an array index, which is necessary because compaction creates offset holes that break arithmetic indexing.
ReadLines accepts either a list[str] of in-memory lines or a file path string as its input source, providing two input modes from the same stage.
LinearizableRegister routes both reads and writes through the same TOB slot mechanism, giving them positions in the same total order — no read-only fast path exists.
Once a node's local status is dead, receive_gossip will never change it back to alive, even if the remote has a higher heartbeat counter — preventing flapping after failure declaration
receivegossip deep-copies remote entries before inserting them into local membership, extending the isolation guarantee beyond the sendgossip/join/getmembershiplist paths covered by gossip-deep-copy-isolation
Unlike detectfailures which skips the node's own ID, receivegossip has no self-exclusion guard — if the remote gossip contains an entry for the receiver, it is merged normally
receivereplica in vector-clocks/vector_clock.py is the receiving side of anti-entropy replication; nothing in the codebase calls it automatically — tests simulate the call manually, and no gossip or sync module invokes it
reconcile unconditionally replaces all siblings for a key with a single merged version, regardless of whether the provided contexts cover every existing sibling.
reconstruct_state never modifies the EventStore; it is a read-only fold over a single stream's events that returns a locally-constructed dict
reconstructstate always replays from the beginning of the stream via readstream with default from_version=0; it does not leverage snapshots, unlike Projection
The upto parameter in reconstructstate filters on global event_id, not stream-local version number, despite operating on a single stream
Each WAL record starts with a 4-byte little-endian length prefix (not counting itself), followed by CRC, seqnum, optype, and length-prefixed key/value pairs — minimum 25 bytes per record for empty key/value
Corruption of recordlength causes read_record to consume wrong bytes for the payload, which fails the CRC check indirectly — making explicit CRC coverage of the framing field unnecessary
recoverseq_num finds the max sequence number but does not detect or report gaps in the sequence space, meaning lost records between rotations or corruption are silently ignored.
recoverseq_num returns the maximum sequence number across all WAL files, not just the latest file, ensuring correctness even when records are spread across rotated segments.
recoverseqnum treats CRC errors as file-local (abandons the corrupted file, continues scanning subsequent files), unlike readallrecords which stops all scanning on corruption.
A single restart simultaneously destroys both durability and isolation guarantees along independent axes: the WAL's carefully calibrated write-time durability tiers (per-write fsync vs batch-only fsync) become invisible to recovery, and the transaction isolation model's composed invariant layers (MVCC visibility plus SSI conflict detection) break because abort metadata and monotonic counters are non-persistent.
Coordinator.recover() checks is_available() before re-sending decisions and skips unavailable participants, meaning a double failure (coordinator crash + participant crash) leaves locks held until both are up and recovery re-runs.
All three storage implementations (WAL, hash-index Bitcask, log-structured Bitcask) scan every segment file on startup to rebuild state, making recovery time proportional to segment count rather than data volume.
Recovery is paradoxically both over-engineered and under-implemented: the WAL builds complete sequence number and checkpoint infrastructure that's never consulted during replay, while the actual recovery paths lack batch atomicity and metadata CRC protection — engineering effort was invested in infrastructure that goes unused while the active recovery path remains unsafe.
None of the DDIA reference implementations combine tombstone support with time-bounded garbage collection; the multi-leader module stores tombstones indefinitely and the Dynamo module has no delete support at all.
Both Bitcask compaction paths (hash-index-storage/bitcask.py:297, log-structured-merge-tree/bitcask.py:301) perform os.rename() without a subsequent directory fsync, making the rename non-durable on ext4 and XFS
LockService.renew() extends the lock TTL by mutating issued_at and ttl without changing the FencingToken.token value, while re-acquire() by the same client issues a new higher token
Despite the docstring claiming replay "skips uncommitted batches," the implementation returns all PUT/DELETE records regardless of whether a matching COMMIT record exists — a crash mid-batch will replay partial batch records
replay flushes the write file descriptor under self._lock before reading, ensuring all prior append calls are visible on disk during the read phase
Both the unbundled WAL and B-tree WAL replay all CRC-valid records without verifying batch completeness, meaning partial batches from mid-write crashes are silently applied as if they were complete transactions.
The file read phase of replay runs without holding self._lock, so concurrent append calls during replay could produce a partial read of in-flight writes
replay returns an in-memory List[WALRecord] snapshot, not a lazy iterator — the full result set is materialized before returning, which can be large if the WAL is large
replay(afterseq=n) returns records with seqnum strictly greater than n (exclusive lower bound), not greater-or-equal — passing 0 replays the entire log
Replica.get does not consult self.available; filtering unavailable replicas is the caller's responsibility via ReadRepairStore.available_replicas().
Replica.get performs no mutations, availability checks, or I/O — it is a pure dictionary lookup, unlike ReadRepairStore.get which triggers read repair on stale replicas.
Replica.get returns a (value, version) tuple when the key exists, or None when it doesn't — callers must guard against None before destructuring, as there is no sentinel tuple.
The version check and store update in Replica.put are not atomic (TOCTOU race on lines 22–25); the method is unsafe under concurrent access without external locking.
Replica.put() silently rejects writes with a version strictly less than the stored version (returns False), enforcing monotonic version advancement; equal versions are accepted as last-writer-wins.
Replica.put does not generate version numbers; version assignment is the responsibility of ReadRepairStore.put, which reads the max version across all replicas (including unavailable ones) before writing.
Versions assigned by ReadRepairStore.put start at 1 (computed as max_version + 1 with initial max of 0), so a version of 0 in ReadRepairStore.get return values means "key not found," not "initial version."
When the new primary merges prepared sets from VIEW_CHANGE messages, it deduplicates by sequence number using first-writer-wins without verifying that all nodes prepared the same request at that sequence — relying on the PBFT safety proof that at most one request can be prepared per (view, seq).
writerfieldmap is constructed but never referenced in resolverecord, making it dead code likely left from a refactor
Default values in resolverecord are inserted by reference without deep copying, so mutable defaults (lists, dicts) would be shared across all decoded records that use the default
Writer fields not present in the reader schema are fully consumed from the buffer via _skip() and discarded, ensuring the byte stream stays correctly positioned for subsequent fields
resolverecord uses a two-pass algorithm: pass 1 consumes all writer fields from the buffer in wire order (decoding or skipping each), pass 2 assembles the result dict in reader field order from decoded values and defaults
Each resource in FencedResourceServer maintains its own highesttoken entry, so a high token on resource A does not cause rejection of a lower token on resource B.
applyremotechange appends accepted changes to the receiving node's _pending list, enabling store-and-forward propagation in ring topology; without this re-enqueue, changes would stop at the first hop.
RING topology advances changes by exactly one hop per sync() round due to snapshot-isolated pending queue draining, requiring N-1 rounds for a single-source change to reach all N nodes.
Ring topology requires O(N) sync rounds for full convergence because each round advances changes by exactly one hop via store-and-forward requeuing, unlike fully-connected topologies that converge in one round.
A 3-node ring topology requires at least 2 sync() rounds to achieve full convergence, unlike a fully-connected topology which converges in 1 round.
The WAL _rotate() method unconditionally calls flush() and os.fsync() before closing the current segment, providing a durability checkpoint even in none mode
After mayberotate returns, self.fd is guaranteed non-None and open for writing (assuming it was non-None on entry), because rotate() always opens a new file.
No test in testbloomfilter.py exercises the counter saturation boundary (16+ collisions at a single position) to verify deletion correctness under overflow
savesnapshot uses copy.deepcopy to isolate the stored snapshot from subsequent mutations to state
Each new slice in ScalableBloomFilter uses FPR of p * (ratio ^ slice_index), so the aggregate false positive rate stays bounded even as the filter grows unboundedly
If readentry encounters truncated data mid-range, scanrangeforkey silently returns (False, None) rather than raising an error, making corruption indistinguishable from a missing key
scanrangeforkey assumes entries between start and end are sorted by key ascending; it breaks early on k > key, so unsorted data causes silent false negatives
scansegment must be called on segments in ascending segment-ID order for the index to correctly reflect last-write-wins semantics; no runtime check enforces this ordering — it is the caller's responsibility.
SchemaRegistry is a pure in-memory dict with monotonic ID assignment, with no persistence, HTTP API, or subject-based grouping
Schema registration accepts any schema without checking compatibility against previously registered versions under the same subject, unlike Confluent which enforces backward/forward/full compatibility on registration
The Document dataclass exists as a domain concept but the DB classes accept pk and fields separately and never consume Document instances.
get() on a nonexistent key returns OperationResult(data=None) and querybyfield() for absent values returns empty results rather than raising exceptions; documents missing indexed fields are stored but not indexed.
Both DocumentPartitionedDB and TermPartitionedDB remove stale index entries before writing new ones during put and delete, preventing phantom results from old field values.
seen maps each key to a set of (timestamp, originnode_id) tuples; a change is skipped if its exact (ts, origin) pair is already in the set for that key
Both Bitcask implementations rotate to new segment files at a configurable size threshold, bounding the maximum data loss from stop-at-first-error to the records trailing the corruption within a single segment
A bit flip in a WAL record's seq_num passes CRC validation and is silently accepted with the wrong sequence number, potentially corrupting replay filtering, truncation boundaries, and recovery sequence numbering
Adding seqnum to the CRC input requires changing two lines (encoderecord and read_record) with no measurable runtime overhead, since CRC32 over 9 extra bytes is negligible
serializeleaf packs all entries into a buffer without comparing its length to page_size; overflow prevention is entirely the caller's responsibility
serializeleaf is a pure function with no side effects; all I/O happens in the caller via walwrite_page
Proof siblings are tagged "left" or "right" because SHA-256 concatenation is order-dependent — H(A||B) != H(B||A) — so swapping sibling position produces a different parent hash and verification fails.
Leaf pages carry a nextsibling field in their serialized format (NOSIBLING = 0xFFFFFFFF sentinel), but no tree operation uses sibling pointers for traversal or merging — it is scaffolding, not a live data structure. Note: may refine existing btree-leaf-sibling-chain.
Sibling pointers exist only on leaf nodes (serializeleaf at line 182); internal nodes have no right-links, making this a B+-tree, not a B-link tree
VersionedKVStore.put never drops an existing version unless the new version's clock dominates it; concurrent versions are preserved as siblings.
A bit-flip in a value region of hash-index-storage/bitcask.py causes get() to return corrupted data with no error; the only partial integrity check is the key-equality assertion
Silent data loss is the default operational mode: corruption propagates through every data transformation pipeline without detection (compaction copies without CRC validation, hint generation skips integrity checks), and the testing methodology cannot observe durability failures (no crash tests, no fsync verification), meaning data degrades continuously with no feedback signal to operators.
SILENT Byzantine mode returns an empty list for all outbound messages, but the node still processes incoming messages and updates internal state — simulating a crash fault rather than a protocol violation
PageManager stores metadata (page 0) and data pages in a single file, so one os.fsync() call in sync() covers both; splitting into separate files would require multiple fsyncs with ordering constraints in the commit fence
BTree.put raises ValueError for any single key-value pair that cannot fit in a page, blocking the case where even one entry would overflow
BTree has zero synchronization primitives; the sibling chain, split logic, and page allocation are correct only because no concurrent reader or writer can observe intermediate states
In the synchronous single-threaded LSM execution model, the missing immutable memtable lifecycle step causes no observable data loss because _flush() completes atomically within the caller's execution flow
Size-tiered compaction triggers when the total SSTable count meets the min_threshold parameter
snapshot only includes committed versions (committs <= snapshotts); all three call sites manually overlay tx.writes and tx._deletes when they need the transaction-local view
Each projection's snapshot is stored under self.name in store.snapshots, so two projections with the same name on the same store will overwrite each other
When event sourcing snapshots are missing, the system falls back to replaying all events from the log to reconstruct aggregate state — correct but O(n) in total event count
SSIDatabase.snapshot returns a new dict each call; callers in readpredicate and commit mutate it freely (overlaying writes, removing deletes) without corrupting the MVCC store
Snapshot data (store.snapshots) contains only state and position; subscriber callbacks registered in EventStore._subscribers are not captured, so restoring a snapshot does not restore live-update behavior
During commit validation, snapshot(self.nexttimestamp) captures all committed versions because nexttimestamp is strictly greater than any existing committs
Projection snapshots are stored in store._snapshots (an in-memory dict), not persisted to disk, so they are lost on process restart
The snapshots attribute is not declared on EventStore; it is lazily created by Projection.savesnapshot via hasattr/setattr on the store instance
Sort stage spills sorted chunks to temp JSONL files when the in-memory buffer exceeds memory_limit, then uses heapq.merge with a KeyedRecord wrapper (seq tiebreaker) for stable k-way merge
SortMergeJoin inspects whether input is already sorted and reports this via stats["sorted_input"], allowing it to skip the sort step on pre-sorted data
A corrupted data entry in an SSTable only affects lookups within one sparse index segment (between two adjacent index offsets, default 16 entries); keys indexed by other segments remain correctly accessible
Election messages from demoted leaders are delivered one level deep (election message + immediate response), not recursively to convergence; the next tick() finishes any remaining propagation.
Only nodes where is_available() is True are considered when detecting and resolving split-brain; crashed nodes' stale leader state is ignored.
Elections triggered by resolvesplitbrain do not call recordleader, so leadership changes from split-brain resolution are absent from electionhistory.
resolvesplitbrain executes only after the tick's while allmessages delivery loop has fully drained, never mid-delivery.
During a leaf split, the new right page is written with the old leaf's next_sibling before the old leaf is rewritten to point to the new page; WAL ordering ensures the pointer target is durable before anything references it
SSIDatabase.commit() returns a structured dict {"committed": bool, "reason": str|None, "conflicts": list} rather than raising, letting callers inspect what conflicted for retry logic
SSIDatabase adds a dependencygraph and predicatelocks on top of MVCC to detect read-write conflicts (write skew) that basic snapshot isolation permits
SSIDatabase.read must check tx.writes and tx.deletes before consulting _store, since a transaction's own writes aren't materialized into the shared store until commit — unlike MVCCDatabase where eager writes make own-writes visible automatically
When pessimistic is true, checkreadconflict raises RuntimeError and sets tx._status = "aborted" at read time rather than waiting for commit — eager detection vs optimistic validation
Phantom detection re-evaluates each stored predicate function against current committed state at commit time, comparing both key sets and values to the snapshot-time result
Read-only transactions (empty write set) bypass all conflict detection in commit() and always commit successfully — they cannot cause write skew because they don't write
SSI's commit path relies on three mutually reinforcing properties: read-only transactions skip validation (safe because the store contains only committed data from prior commits), and within a single transaction, writes and deletes for the same key are mutually exclusive, preventing conflicting modifications to that key within the transaction's own write set.
SSIDatabase.store only contains versions stamped with commit timestamps; uncommitted writes live in tx.writes until commit, so every version in the store is guaranteed to be from a committed transaction
SSIDatabase.visiblevalue determines visibility by a single comparison (committs <= snapshotts) with no active-set tracking, because the deferred-write model guarantees all stored versions are committed
write() discards any pending delete() for the same key and vice versa; a key is in tx.writes XOR tx.deletes, never both simultaneously
SSTableWriter's block_size parameter controls the sparse index frequency (one index entry per N records), not a physical byte-aligned block size; records are written contiguously with no padding or alignment to fixed byte boundaries.
CompactionManager.getsstables() returns list(self.sstables) (a shallow copy), preventing iterator invalidation from concurrent mutation but not preventing file-deletion races on the underlying SSTable files
The CompactionManager tracks maxlevels (default 7) but has no logic to check whether lower levels contain overlapping key ranges before deciding tombstone removal, unlike LevelDB/RocksDB which use VersionSet metadata for this
A crash between SSTableWriter creation and finish() leaves entry_count=0 in the file header while data records exist on disk; SSTableReader trusts the header count and reads the file as empty, silently losing all written entries.
The SSTable format supports exactly two entry types (value and tombstone via TOMBSTONE_MARKER); there is no third type for merge operands, which would be required for RocksDB-style merge support
The SSTable format has no integrity or efficiency safeguards: entries carry no CRC or checksum, range scans bypass the sparse index and scan from file start, and the entry count header is trusted without verification — meaning corruption goes undetected, negative lookups are unnecessarily expensive, and truncated files produce silent data loss.
SSTableReader.get() returns None for keys not found in the table, rather than raising an exception
SSTable.get extracts sparse index keys into a fresh list on every call rather than caching them, making each lookup O(m) in index size before the binary search begins
For SSTables loaded from disk, loadindex() must be called before get() or every lookup silently returns (False, None) — the method does not call loadindex() itself
The SSTable format (sstable.py:10-11) uses a 4-byte magic b"SSTB" and a 2-byte version field, making it the only binary format in the repo with format versioning
SSTableEntry in sstable.py carries a timestamp field, but lsm.py's merge and conflict resolution uses only structural SSTable ordering (sequence number priority), not per-entry timestamps
The SSTable layer has mutually reinforcing integrity and performance deficiencies: no per-entry CRC or checksum means corrupted entries return wrong values silently, and no Bloom filter integration forces every negative lookup to scan all SSTables sequentially, maximizing both the probability and frequency of encountering those silent corruptions.
Point lookups in both SSTable implementations first binary-search the sparse index to identify a block, then linearly scan entries within that block's byte range; total cost is O(log(N/B) + B)
The 4-byte magic b"SSTB" at offset 0 is the only data integrity check in the sstable-and-compaction SSTable read path; it gates file-type validation but provides no corruption detection for keys, values, timestamps, or structural offsets
merge_sstables() takes SSTableReader instances directly as inputs, using a streaming iterator interface rather than loading entire files into memory
When multiple SSTables contain the same key, merge_sstables keeps only the entry with the highest timestamp via heap ordering (key, -timestamp), silently discarding all older versions.
The k-way merge in sstable-and-compaction/sstable.py is forward-only with no direction tracking or reverse iteration support; there is no direction enum, no FindLargest, and no heap rebuild on direction change.
SSTableMetadata in sstable-and-compaction/sstable.py tracks a level field, indicating the compaction module is structured for leveled compaction even though lsm.py uses flat single-level merges
SSTableMetadata.level in sstable-and-compaction/sstable.py exists as a field but has no persistent backing; level assignments would be lost on restart without a MANIFEST equivalent
The SSTable format has no CRC or checksum on entries or blocks; the only validation is a magic-byte assertion (MAGIC == b"SSTB") on file open.
range_scan(start, end) is inclusive on start and exclusive on end.
SSTableReader.range_scan(start, end) scans from the beginning of the file rather than using the sparse index to skip ahead, making it O(N) regardless of range position.
SSTable.scan() and scan_all() use Python generators (yield), meaning individual SSTables already support streaming — only the cross-SSTable merge layer materializes
readentry() in the LSM SSTable checks for short reads on key and value fields, preventing buffer overruns on truncated files but silently skipping all remaining entries in the scan range
SSTableWriter.add() does not enforce sorted key order; violation silently corrupts binary search in SSTableReader.get().
A corrupted SSTable data entry only affects lookups within one sparse index segment; the sparse index provides independent entry points into different file regions, bounding corruption blast radius to ~block_size entries
SSTableReader loads a sparse index into memory that records every block_size-th key and byte offset; point lookups binary-search the index then linearly scan within the block.
Both SSTable implementations store the sparse index at the end of the file with a footer pointer, meaning bloom filters could be co-located in the same footer region without changing the file layout
test_sstable.py is a linear script using bare assert statements and print progress markers, not a pytest/unittest suite — it runs top-to-bottom in a tempfile.TemporaryDirectory context.
Tombstones are encoded on disk as a single 0xFF byte in the value position; this is unambiguous because valid value-length fields are 4 bytes big-endian and cannot start with 0xFF at realistic value sizes.
Tombstones are represented at the API level as SSTableEntry objects with value=None, not as a separate deletion record type.
The SSTable's 12-byte trailer (footer_start offset + entry count) is an unprotected single point of failure; its corruption makes the entire file's sparse index and data entries unreachable with no fallback discovery mechanism
The SSTable write path has two independent fragility points: sorted key order is the caller's responsibility with no enforcement (violation silently corrupts binary search), and the file header is written as a placeholder then patched via seek-back after all entries are written (a crash between data write and header patch leaves a structurally invalid file).
SSTableWriter writes a placeholder header, appends all entries and the sparse index, then seeks back to byte 0 to patch the final entry count — an unfinished SSTable has entry_count=0 and appears empty.
sstable-and-compaction/sstable.py:SSTableWriter.finish() closes the file without calling flush() or os.fsync(), relying on implicit Python close-time buffer flush with no disk durability guarantee.
The SSTableWriter.add() → finish() lifecycle is the natural integration point for per-SSTable bloom filter construction, since keys are already iterated in sorted order during the write pass
SSTableWriter.finish() at sstable-and-compaction/sstable.py:91 calls close() without prior fsync(), leaving newly flushed SSTables vulnerable to loss if the OS page cache hasn't been written to disk
SSTableWriter binds to one file descriptor at construction with no method to rotate to a new file mid-write, making it structurally incompatible with size-bounded output splitting
Stage.trackedprocess subtracts time spent after each yield (downstream processing time) to attribute wall-clock time accurately to each individual stage
Standard bloom filters lack deletion support, but SSTables are write-once-then-discarded, so deletion is never needed and the 8x space savings over counting filters is free
stcscompact both removes old and appends new entries to self._sstables as a side effect; callers must not hold references into the list across a compaction call.
Size-tiered runcompaction() processes every bucket that meets minthreshold in a single call, while leveled compaction processes at most one level per call.
Size-tiered compaction only merges SSTables in the same size tier (determined by gettier() comparing filesize against size_thresholds); it never merges a small SSTable directly with a much larger one.
A size tier must accumulate at least 4 SSTables (the default minthreshold) before stcscompact merges them; buckets below this threshold are skipped entirely.
Size-tiered compaction allows multiple SSTables at the same tier to contain overlapping key ranges, requiring point lookups to check all SSTables in a tier
No storage engine has a fully safe crash recovery path: compaction lacks atomicity, WAL replay ignores batch boundaries, and CRC checksums leave routing metadata unprotected — corruption can enter via unprotected metadata, persist through non-validating compaction, and survive recovery via batch-unaware replay.
Storage engines degrade monotonically during normal operation (leaked pages, growing height, no rebalancing) and have no safe recovery after crashes (non-atomic compaction, partial batch replay, unchecked metadata) — there is no path from degraded state back to healthy state.
Three core storage operations materialize entire datasets in memory with no streaming, pagination, or backpressure: LSM range scans load all matching entries into a dict before returning, compaction buffers all merged entries into a list before writing, and SSTable lookups rebuild the sparse index key list on every call — creating O(n) memory pressure proportional to total data size rather than result size.
RangePartitionedStore.rangescan passes the same global startkey/endkey to every overlapping partition rather than clamping to partition boundaries, relying on bisectleft to naturally exclude out-of-range keys.
ConflictStrategy implements LWW and custom merge but not CRDTs; the CRDT module exists separately in crdts.py with no integration into the multi-leader replication strategy dispatch
CompactionManager compares strategy with == against "size_tiered" — any other string silently selects leveled compaction with no validation or error.
StreamJoinProcessor expires buffered events when they fall below watermark - window_duration, bounding memory at the cost of potentially dropping late-arriving matches
The processor actively expires events so buffer sizes remain proportional to windowduration × eventrate, not total events processed; tests confirm buffers stay under 10 per side after 1000 events with a 5-second window.
Stream join correctness depends on consistent alignment across three independent window mechanisms: join and aggregation windows are configured independently with no constraint preventing misalignment, expiration uses a one-sided cutoff that doesn't match the symmetric matching predicate, and buffer size is bounded only by the window duration relative to watermark advancement rate.
get_results() swaps out and resets the accumulated results list; callers see only results since the previous drain (pull-based consumption model)
Inner join only produces a JoinResult when both key equality and |tleft - tright| <= window_duration hold; either condition failing alone prevents a match.
Events arriving after watermark - allowedlateness are dropped and counted in stats.lateevents_dropped; the processor never raises exceptions for late arrivals.
JoinType.LEFT emits misses only for unmatched left-side events; unmatched right-side events are silently dropped without any miss emission
Outer-join misses are only emitted at event expiration time (when timestamp falls below watermark minus window duration), never eagerly on arrival
The processor does not validate that an event's streamname matches either configured stream; an unknown stream name is silently treated as the right stream via is_left returning False
A single event can match multiple events on the opposite side; the matched flag is set on all participants, so none produce outer-join misses
process_event() returns matches synchronously on each call (push-based) rather than batching to a flush boundary; results are available the instant a match is found.
The join processor's watermark only advances forward via max(current, event.timestamp); advance_time() silently ignores timestamps at or below the current watermark
streamversion() returns the count of events in a stream (not the latest eventid), and expected_version checks against this count for optimistic concurrency
Event subscriber notification in EventStore blocks the append() caller — projections update on the writer's thread before the append call returns, trading throughput for zero-gap consistency
CRDTReplicaGroup.sync_all runs exactly 2 rounds of all-pairs sync, which is sufficient for convergence of any state-based CRDT with a commutative/associative/idempotent merge
When syncmode="none", the per-write do_sync() call (default force=False) is a complete no-op — no fsync and no batch queue — trading durability for maximum write throughput
The system converges on permanent dark failure: there is no operational path to correctness at any scale AND failure is irreversible and undetectable, meaning the system silently accumulates data loss with no diagnostic signal, operational remedy, or evolutionary escape — even the decision to rebuild requires evidence the system cannot produce.
The system degrades monotonically at every abstraction level with no equilibrium: storage engines erode structurally with no self-healing (leaked pages, growing height, no rebalancing, unsafe recovery), and the distributed layer's compensating mechanisms (anti-entropy, read repair) cannot fully resolve the resulting replica divergence, which accumulates without bound.
The system's natural trajectory is toward permanent undetectable data loss: corruption propagates silently through every pipeline without detection, AND the structural degradation that enables it is irreversible at every abstraction level — failure is invisible while it happens and unfixable after it's discovered.
The system has no stable operational regime at any scale: storage degrades monotonically during normal operation with no self-healing, failures widen correctness gaps at every abstraction level, and no storage paradigm can escape by scaling (hash indexes hit memory walls, LSM hits probe walls) — the architecture converges toward failure under every operating condition.
No layer in the system can reverse a completed operation: the WAL stores only new values with no before-images (making rollback structurally impossible), and transaction abort is a status-flag change that leaves all written versions on disk, meaning neither the storage layer nor the logical transaction layer supports undo.
The system is trapped in a verification-repair deadlock: verification is impossible at every architectural layer (integrity checks degrade along the pipeline, protocol safety is unfalsifiable under current testing) AND the rigid binary format design across the storage stack prevents adding verification or self-healing capabilities through evolution, foreclosing both detection and correction of faults.
TermPartitionedDB.querybyfield looks up a single index partition by term hash then fetches documents from their home partitions, touching 1 + K partitions (K = distinct data partitions) instead of scattering to all.
TermPartitionedDB.put touches 1 + N partitions where N depends on how many distinct index partitions the document's field values hash to, because index entries live on term-hashed partitions separate from the data partition.
Tests universally use block_size=4 while SSTableWriter defaults to 64 and lsm.py defaults to 16; the small test value forces multi-block index paths with small datasets
The tester and developer test suites have overlapping but non-hierarchical coverage: tester files uniquely test spec-example compliance and cross-path equivalence, while dev files uniquely test crash recovery and internal state
The testertest*.py files are generated by the code-expert workflow's tester stage from implementation specs, distinct from hand-written test_*.py pytest files
The testertest*.py files contain no auto-generation markers and are hand-authored standalone test suites, not generated by code-expert or any other pipeline — contradicts existing tester-files-are-generated-artifacts
Every testertest*.py file contains an if _name == "main_": block, making it executable via python without pytest
Tester test files import only the public API (e.g., from btree import BTree), while developer test files import internal types like WAL, serializeleaf, and HEADER_FMT for state injection and internal invariant checking
Tester files print "test_name PASSED" to stdout, indicating an output-parsing runner rather than pytest's exit-code-based reporting
The testertest*.py prefix does not match pytest's default collection patterns (test*.py / *test.py), so these files are excluded from pytest runs unless explicitly specified
Tester files embed runtime metrics in their pass/fail output (tree height, page counts, pass/fail ratios) that pytest's structured reporting does not surface without the -s flag
At least 6 testertest*.py files import pytest despite being designed for standalone execution, indicating the two suites share lineage rather than being fully independent
TestSpecExample tests in tester files replay the exact spec usage example verbatim as a living executable spec, a form of spec-drift detection that developer test suites do not replicate
Tester files decompose monolithic pytest tests into smaller focused functions with docstrings (e.g., testbasic becomes testbasicputget + testrangescan + test_persistence)
Some testertest*.py files are pytest-only (using fixtures/pytest.raises), while others include an if _name == 'main_': block for standalone execution; the two execution patterns coexist across modules
No collectignore, norecursedirs, testpaths, or pythonfiles directive exists anywhere in the repo, so testertest*.py files will be collected by pytest's default *_test.py glob
There is no shared conftest.py, base class, or test harness across testertest*.py files; each imports only its own module under test plus standard library utilities and is fully self-contained
Each testertest*.py is a simplified subset of its corresponding test_*.py, organized by numbered behavioral properties (# 1. No false negatives, # 2. FPR within 2x of target) rather than exhaustive edge cases
testertest*.py files contain no auto-generation markers (generated, DO NOT EDIT) and no regeneration tooling exists in the documented code-expert workflow; they are static committed files
testertest*.py files test the public API contract of each implementation, importing only public symbols (plus format constants for fault injection), acting as external conformance harnesses
The testertest*.py files consistently contain more test functions than their test_*.py counterparts (bloom-filter: 11 vs 5, gossip-protocol: 12 vs 10), and include tests for features not covered in the default-discovered files
The testing methodology covers neither single-node crash failures nor distributed asynchronous failures: crash recovery paths are systematically untested (no torn-write tests, no compaction crash tests), and distributed protocols are validated only under synchronous deterministic simulation with no real network I/O — the entire failure surface area from storage crashes to network partitions is invisible to the test suite.
When a TOB slot is decided by another proposer with a different value, the losing node re-proposes its value for a new slot rather than retrying the same slot with a higher round
ConsensusInstance.prepare(n) / accept(n, val) use monotonic proposal numbers where a higher prepare preempts a lower one, causing accept with a stale number to return {accepted: False}
Messages are delivered to the application only when all prior slots are decided; nextslot advances through a contiguous run, and a gap (undecided slot N when N+1 is decided) blocks all later delivery
Every slot in Total Order Broadcast runs the complete two-phase Paxos protocol (Prepare→Promise→Accept→Accepted) from scratch; no state carries between slots to skip Phase 1
LinearizableRegister.read() broadcasts the read through consensus rather than reading locally, establishing a consistent point in the total order — necessary for linearizability without leader leases
LinearizableRegister wraps TOBCluster to provide write()/read()/compareandset(), demonstrating the DDIA Chapter 9 equivalence between total-order broadcast and linearizable storage
TOBNode has no leader election or lease mechanism; any node can propose for any undecided slot at any time, making it vulnerable to competing proposals
TOBNode crash/recovery uses state transfer from peers via force_decide() on each missed slot; there is no WAL, persistent log, or on-disk state — all durability depends on a majority staying alive
Each slot in the total order log is decided by an independent single-decree Paxos instance (ConsensusInstance); there is no stable-leader optimization or Multi-Paxos Phase 1 skip
When a proposer's value loses its slot to a competing value, the original is pushed back onto _pending for proposal in a later slot, ensuring no broadcast messages are silently dropped
Proposal numbers are computed as round * numnodes + nodeid, guaranteeing uniqueness across nodes within the same round but restarting at round 0 for each slot independently
The TOB cluster uses 3 nodes where 2 can make progress but 1 alone cannot, confirming a strict-majority quorum requirement for both Paxos Phase 1 and Phase 2
After recover_node(), the recovered node's delivery order matches live nodes' order, meaning recovery replays all consensus decisions made while the node was down via state transfer
Each slot in the total order broadcast is decided by an independent ConsensusInstance running single-decree Paxos with prepare/accept phases (totalorderbroadcast.py:3).
In handleprepare_response, if any acceptor reports a previously accepted value, the proposer must adopt the one with the highest proposal number — this is the core Paxos safety invariant preventing decided values from being overwritten
Tombstone management is handled differently at every layer: the LSM uses an ambiguous empty-bytes sentinel indistinguishable from empty values, compaction preserves tombstones by default with caller-controlled removal, and distributed deletion requires cross-replica convergence before safe removal — with no coordination between these concerns.
The tombstone/value distinction in sstable-and-compaction/sstable.py rests on a single byte (0xFF vs first byte of a 4-byte value length), making it vulnerable to single-bit corruption that silently converts live entries to deletions or vice versa
A tombstone for key K can only be safely removed during compaction if every SSTable that could contain an older entry for K is included in that compaction run; otherwise the deleted key can be resurrected from a surviving SSTable
A tombstone record in segment N removes the key from _index even if the key's live value was written in an earlier segment, because the index is a flat dict shared across all segment scans.
ConflictRecord.remotevalue reports None for tombstoned changes rather than the internal TOMBSTONE sentinel; callers see deletion as None and never observe the sentinel object.
Storing b"_BITCASKTOMBSTONE_" as a legitimate value in log-structured-hash-table causes silent data loss on next recovery, as scan_segment interprets it as a delete — the sentinel is an implicit API constraint with no validation
Replication topology affects only the duration of observable divergence, not the final convergence outcome: all topologies use identical deterministic LWW resolution, but ring topology creates O(N) rounds of divergence while all-to-all converges in O(1) — with no adaptive topology switching, the system cannot minimize the divergence window in response to cluster size or partition conditions.
Both topologies use the same deterministic (timestamp, node_id) comparison for LWW resolution; topology affects when and where conflicts are detected, not which value wins.
If a torn write corrupts the 4-byte length prefix in wal.py:readrecord, the reader interprets garbage as record_length, reads that many bytes (consuming valid subsequent records as data), then returns None on short read — no error is raised and no resync is attempted
Serializable snapshot isolation is achieved by composing two complementary invariant layers: MVCC provides visibility correctness (append-only versions, own-writes guarantee, symmetric deletion visibility) while SSI adds serializability enforcement (read-only optimization, committed-only snapshots, write-delete mutual exclusion).
Transaction isolation's carefully composed two-invariant-layer model (MVCC visibility plus SSI conflict detection) is fragile under restart: abort is a status-change-only operation that leaves written data on disk, and MVCC counters that must be monotonic have no persistence mechanism, so a crash both resurrects aborted writes and breaks the ordering invariant that determines visibility.
Transaction abort is a status-change-only operation with no disk rollback, and MVCC correctness depends on monotonic counters that have no persistence mechanism, meaning a crash simultaneously leaves aborted-transaction writes on disk and resets the ordering counters that determine visibility — both invariants fail together.
The transfer dict uses (arcstart, arcend) tuples as keys, so two vnodes producing identical arc boundaries would silently overwrite rather than accumulate.
testaddnodereturnstransfers only asserts transfer direction (A→B) and existence, not arc non-overlap or total size correctness — the code is correct but the test doesn't prove it.
Since neither the reference implementation nor PostgreSQL merges internal nodes on delete, tree height only grows (on root splits) and never shrinks, even under heavy deletion
Topic.truncate updates baseoffsets[partition] with += actual, a relative shift, because it only removes a contiguous prefix from the front of the partition log.
The non-atomic truncate() in wal.py can produce reordered or incomplete files that pass per-record CRC validation; chained checksums would detect the corruption at the chain break point
TumblingWindowAggregator aligns window boundaries to multiples of the window size starting from 0 — windows are [0, size), [size, 2*size), etc., not relative to the first event's timestamp.
TumblingWindowAggregator assigns windows by floor-dividing the timestamp by window size, producing aligned non-overlapping boundaries regardless of when events arrive
lsm.py (using sparseindexinterval=16 and bisect) and sstable.py (using block_size=64 and manual binary search) implement the same sparse-index-with-block-scan pattern with different defaults, naming, and file format maturity
The standalone wal.py uses a logical WAL (keyed PUT/DELETE operations with COMMIT markers) while btree.py uses a physical WAL (raw page images); they solve different problems and have different recovery semantics
Catch-up via snapshotandstream and rebuild via full CDC event replay must produce identical derived-system state, verified by comparing get_state() output from two independent derived systems
A derived system added with catch_up=True receives all historical CDC events to reach current state, without requiring a separate snapshot mechanism.
Every update and delete CDC event includes the previous value (oldvalue), enabling derived systems to undo prior state; inserts have oldvalue=None.
The unbundled database wires independent subsystems (WAL, storage engine, CDC, derived systems) through a shared append-only log with independent consumer positions, applying Unix-style composition at the system architecture level
Derived systems (secondary indexes, materialized views, full-text search) only see mutations after db.flush() is called; writes go to WAL and storage immediately but CDC consumers are decoupled.
The unbundled database writes to WAL before storage engine, while the CDC module writes to in-memory rows before appending the log — opposite ordering of the source-of-truth relationship
UnbundledDatabase.put() and delete() return CDCEvent objects directly, making CDC a synchronous, first-class part of the write API rather than a side-channel.
rebuildsystem() must produce state identical to incremental live processing — this is a tested invariant verified by capturing getstate() before and after rebuild.
The unbundled database WAL assigns LSNs starting from 1 (not 0), with latest_lsn == 0 indicating an empty log.
The unbundled database's WAL supports optional file-backed persistence via a .jsonl file path, and a new WriteAheadLog instance recovers entries from that file on init.
After db.flush(), get_lag() returns 0 for all derived systems, confirming all pending CDC events have been consumed
LSNs returned by the unbundled database's put()/delete() are 1-indexed and strictly sequential with no gaps
StorageEngine.rebuild(wal) clears all existing data (including manually injected entries) before replaying WAL entries, ensuring no phantom or stale state survives a rebuild
WAL entries in the unbundled database's _entries list are always in ascending LSN order, maintained by the sequential nature of append()
The unbundled database WAL cutoff is exclusive-below: entries with lsn >= cutoff are retained, entries with lsn < cutoff are discarded
WriteAheadLog.truncatebefore in the unbundled database removes entries from memory but does not modify the on-disk WAL file, so reloading from persistpath restores truncated entries
Truncation in the unbundled database WAL never resets nextlsn, so appending after truncation continues with monotonically increasing LSNs
UnfencedResourceServer has no token parameter on write() and accepts all writes unconditionally, serving as the pedagogical unsafe baseline for comparison with FencedResourceServer
Events whose eventtype has no entry in the handlers dict are skipped without warning in both reconstructstate and Projection.catch_up
If a segment contains a key with invalid UTF-8 bytes, scansegment raises an unhandled UnicodeDecodeError that aborts recovery for all subsequent segments.
B-tree values are written as raw bytes with no encoding; keys accept str (auto-encoded to UTF-8) or bytes, but values must already be bytes
Pruning permanently discards causal information; subsequent compare() or dominates() calls treat pruned nodes as counter=0, which can produce false concurrency or false dominance
VectorClock.prune(n) retains the n entries with the highest counter values and discards the rest, using sorted descending by counter
When multiple nodes share the same counter value during prune, which survive depends on Python's sorted stability over dict iteration order — not a deterministic tiebreak policy callers can rely on
VectorClock.compare returns one of four string values — BEFORE, AFTER, EQUAL, or CONCURRENT — implementing a partial order where the symmetric property holds (BEFORE ↔ AFTER)
VectorClock is immutable: increment, merge, and prune all return new instances and never modify _clock in place.
Verification of system correctness is impossible at every layer of the architecture: data integrity verification degrades from partial to absent along the storage pipeline (WAL CRCs exclude metadata, SSTables have no checksums at all), while protocol safety claims are unfalsifiable because all distributed testing uses synchronous simulation that cannot exercise the asynchronous failure modes the protocols are designed to tolerate.
The direction field in proof siblings is not validated; any value other than "left" is silently treated as "right" via the else branch.
verifyproof concatenates hex-encoded hash strings and encodes to bytes before hashing, matching the exact same scheme used in init_ to build internal nodes — a mismatch would silently break all proof verification.
verify_proof is a @staticmethod with no side effects; it requires no tree instance, only the data and a MerkleProof, so it can run on a different machine than the one that built the tree.
verify_proof confirms data matches a given root hash but does not authenticate the root itself; callers must obtain a trusted root through an independent channel (e.g., a signed block header).
put() determines the next version by scanning all replicas including those marked unavailable, which prevents version regression but couples version assignment to unavailable node state.
VersionedKVStore.get returns a list of VersionedValue entries; concurrent (causally unrelated) writes produce multiple siblings, and the store never auto-resolves conflicts — clients must call reconcile
The VersionedValue dataclass (dynamo.py:14-18) carries only value, version, and node_id with no field to distinguish a live value from a deletion marker, making correct distributed deletes impossible without schema changes
Receiving a VIEWCHANGE from another node causes a non-primary node to broadcast its own VIEWCHANGE if it hasn't already, propagating the vote through the cluster without requiring all nodes to independently detect the faulty primary.
The new primary only acts on a view change after collecting at least 2f+1 VIEW_CHANGE messages, matching the standard PBFT quorum requirement for liveness.
handleviewchange silently drops any message with msg.view <= self.currentview, enforcing monotonically non-decreasing view progression across all nodes.
isvisible is a pure function with no side effects; it reads committed, aborted, tx.activeatstart, and version fields but never mutates any state.
A version is visible to a transaction only if its creator committed, was not in the reader's activeatstart set, and has a lower tx_id — all three must hold
With 3 equally-weighted nodes and 150 vnodes each, loadimbalance() is asserted to stay below 1.5 (testconsistent_hashing.py:49).
Every WAL method that writes records or modifies files (append, appendbatch, checkpoint, truncate) acquires self.lock before performing any I/O
openlatest opens WAL files in "ab" (append-binary) mode, making it impossible for post-crash writes to overwrite pre-crash records; this is the bridge between recovery and forward progress
append performs no validation of optype against OPBYTES; invalid operation names raise KeyError from the dictionary lookup, not a descriptive error.
Individual append calls are not wrapped in any transaction boundary; for atomic multi-operation writes, append_batch must be used instead.
appendbatch of N items writes N+1 records to the WAL: the N data records plus a trailing record with optype == "COMMIT" that seals the batch
append_batch() always force-fsyncs regardless of the configured sync mode, because batch atomicity requires the COMMIT marker to be durable before returning.
A batch write is only considered complete if its trailing COMMIT record is present and passes CRC; incomplete batches (missing commit) are discarded during replay
The stale writecount after forced syncs causes extra fsyncs (premature batch threshold triggers) but never causes data loss — forced syncs always flush to disk regardless of counter state.
Every append_batch call writes exactly one COMMIT record (op code 3) as the final record in the batch, with empty key and value fields, consuming one additional sequence number.
In WAL batch sync mode, up to batchsynccount - 1 records (default 99) can be lost on kill -9 because os.fsync() is deferred until the batch threshold is reached; only append_batch() force-fsyncs regardless of mode.
append_batch() buffers all operations plus COMMIT into a single write() call to minimize the partial-write window, but the code acknowledges this is not a true atomicity guarantee — a crash during the write can persist a prefix without the COMMIT marker
appendbatch writes the entire batch buffer in one fd.write call before mayberotate, so a batch never spans two WAL files under normal operation
append_batch() serializes all data records plus the trailing COMMIT marker into one bytearray and issues a single fd.write() call, relying on OS write atomicity for small batches
In batch mode, up to batchsynccount - 1 records may be lost on crash because they haven't been fsynced yet.
The WAL's dosync batch mode (fsync every N writes) is overridden by force=True in append_batch and checkpoint, ensuring atomic batch boundaries and checkpoint records are always fsynced regardless of configured sync mode.
The WAL binary format is simultaneously inflexible and fragile: contiguous record packing with no block alignment prevents resync after mid-file corruption, signed 32-bit length fields theoretically admit negative values with no guard, and there is no version field — a format change invalidates all existing WAL files with no migration path and no way to distinguish old-format from new-format records.
checkpoint() increments the WAL sequence counter by 1, occupying a position in the sequence number space alongside data records (e.g., after 6 data records at seq 1-6, currentseqnum() is 7 and checkpoint() returns 8).
checkpoint() calls dosync(force=True), making the checkpoint marker durable on disk regardless of the configured sync mode, so recovery boundaries are never lost even in "none" or "batch" mode.
checkpoint() calls dosync(force=True), making checkpoint markers always durable on disk regardless of the configured sync mode
WriteAheadLog.checkpoint() returns the sequence number it wrote, providing the caller the exact truncation boundary for the WAL-to-data-store coordination protocol
Checkpoint markers consume a sequence number in the same monotonic space as data records — a checkpoint after 7 data records occupies seq=8
WAL.commit clears the entire WAL unconditionally via truncate; there is no partial commit or transaction grouping within a single WAL file
WAL.commit always fsyncs the data file before truncating the WAL file; reversing this order would create a crash-safety hole where committed data could be lost
WAL.commit() in the B-tree engine syncs metadata page 0 because PageManager.sync() fsyncs the single shared file descriptor that holds both data pages and metadata; there is no dedicated metadata fsync step
A concurrent append() writing to a file the iterator is reading can produce a partial record that readrecord interprets as CRC failure, terminating the entire iterator and dropping all remaining valid records across subsequent files
None of the WAL implementations use block-aligned or page-aligned record layouts; all records are packed contiguously with variable-length encoding, so torn writes can land at arbitrary byte offsets
When the WAL encounters a corrupted record during replay, it stops and returns all structurally valid records preceding the corruption point rather than raising an exception to the caller.
recoverseq_num stops at the first corrupted record within each file but continues to subsequent WAL files, so cross-file recovery is resilient but intra-file records after corruption are silently lost
The CRC32 checksum covers only op_type + key + value; a corrupted sequence number would not be detected by integrity checking.
The standalone WAL uniquely includes optypebyte in its CRC input alongside key and value, protecting against silent operation-type corruption (e.g., PUT flipped to DELETE) that the other CRC implementations leave undetected
readall_records catches CRC ValueError and stops iteration entirely (returns, not continues), so corruption in any file aborts reading of all subsequent files too.
WAL and hash-index Bitcask both default maxfilesize to 10 MB, while the log-structured Bitcask defaults to 1 MB — reflecting the latter's heavier reliance on compaction with hint files to offset recovery cost.
WriteAheadLog._init defaults syncmode to "sync", making per-write fsync the safe default that callers must explicitly opt out of for higher throughput
The dosync method's if/elif structure means forced syncs in batch mode skip all counter logic entirely — the counter is neither incremented nor reset, which is the structural root cause of the premature batch-sync bug.
dosync does not acquire self._lock; thread safety depends on every caller holding the lock before invoking it.
The replay() docstring claims it "skips uncommitted batches" (describing the DDIA concept), but the inline comments and implementation show it returns all PUT/DELETE records regardless of COMMIT presence; the docstring is aspirational, the comments are accurate
The WAL's carefully calibrated write-time durability tiers (per-write fsync in sync mode vs. batch-only fsync) are completely invisible to recovery: sequence numbers and checkpoints that could distinguish between definitely-durable and potentially-lost records are never consulted, so replay treats all CRC-valid records identically regardless of whether they were fsynced — the write-time investment in durability is wasted.
encoderecord is a pure function that performs no I/O or state mutation; all disk writes (and fsync) are the responsibility of callers (append, append_batch, checkpoint, truncate).
An EOF or partial read within a single WAL file causes readall_records to advance to the next file rather than terminate the entire iterator
A WAL file whose size equals maxfilesize triggers rotation in open_latest; only files strictly smaller are reused.
WAL file size management avoids filesystem syscall overhead through tell()-based tracking with soft-limit semantics: the hot path uses fd.tell() instead of os.path.getsize() to avoid stat calls and TOCTOU races, the size limit is a soft cap checked before the next write rather than mid-write, and files at exactly the limit trigger rotation to prevent unbounded growth while allowing single-record overshoot.
walfiles() returns segment files sorted by filename ({N:06d}.wal), which equals chronological order since segment numbers increment monotonically
If a crash occurs between SSTable flush and WAL truncation, replay re-inserts already-persisted entries into the memtable, which is safe because memtable values shadow SSTable values during reads (last-writer-wins)
Changing the CRC input in encoderecord invalidates all existing WAL files; a production deployment would require a version byte and dual-path CRC verification during migration
Each log_write call forces an os.fsync, guaranteeing per-entry durability at the cost of one sync syscall per logged page write.
WALRecord stores only key and value (the new value) with no old_value field, making undo structurally impossible from the log alone
WriteAheadLog in wal.py:80 uses a threading.Lock to serialize append, append_batch, checkpoint, and truncate, making it the only concurrency-aware component in the codebase
The WAL module provides explicit truncate(uptoseq) and file rotation for log lifecycle management; the Raft module has no equivalent despite both being append-only log abstractions
mayberotate uses self._fd.tell() instead of os.path.getsize(), avoiding filesystem stat calls and TOCTOU races on every append
The B-tree WAL uses redo-only recovery (replay logged page images forward); there is no undo log, so incomplete pre-commit writes in the data file are overwritten by WAL replay to restore consistency
The unbundled database's StorageEngine is fully derivable from the WriteAheadLog; calling rebuild() replays the entire WAL and reproduces identical state, making the log the authoritative record
iterate() yields every record including COMMIT and CHECKPOINT markers, providing the raw stream needed for commit-aware recovery logic built on top of replay()
Key and value lengths in the WAL binary format are packed as signed 32-bit integers (struct format <i), limiting each field to ~2 GB and leaving negative lengths unguarded.
The WAL binary record format uses little-endian byte order unconditionally (struct prefix <), making log files non-portable across architectures with different endianness.
maxfilesize is a soft limit; a single batch can push a WAL file arbitrarily past it because mayberotate runs only after the write and sync complete
maxfile_size is a soft cap: the check triggers rotation for the *next* write, it does not prevent the current write from exceeding the limit.
The WAL module in write-ahead-log/wal.py uses self._lock for thread safety across its mutation paths, while the LSM tree module has zero locking or concurrency control despite having the same concurrent-access risks
The WAL protocol has no BEGIN record type, making it impossible during replay to distinguish a standalone PUT from the first record of a multi-record batch — the structural root cause of why COMMIT markers cannot enforce atomicity.
The standalone WriteAheadLog class in write-ahead-log/wal.py has no commit method; sync-then-truncate coordination must be implemented by a higher-level storage engine that composes the WAL with a data file
The WAL uses a single threading.Lock and a write counter for batch sync rather than a concurrent waiter queue, so it cannot amortize fsync cost across concurrent callers the way PostgreSQL's group commit does.
The WAL implementation never fsyncs the parent directory after segment creation or deletion, meaning file metadata changes (including unlinks) are not guaranteed durable on crash on Linux
Neither the WAL reader nor any visible consumer checks for sequence number gaps after replay, making silent data loss from mid-file corruption undetectable
The WAL format has no batch-start marker and no nesting support; the lock held during append_batch prevents interleaving, making nested or concurrent batches structurally impossible
A corrupted recordlength in the WAL causes read_record to misframe all subsequent records; there is no magic-byte or scan-forward recovery mechanism to re-synchronize
If write() or fsync() fails mid-batch in append_batch, sequence numbers are already incremented with no rollback mechanism, creating a permanent gap in the sequence space.
The WAL module has no tests for truncated records, CRC mismatches, or partial writes; the B-tree module tests CRC corruption explicitly but the WAL does not
No production module imports the standalone WriteAheadLog; it is only referenced in testwal.py and testertest_wal.py, meaning crash-safe commit coordination using this module is unimplemented
openlatest is called exclusively from _init, so the TOCTOU window between wal_files() and os.path.getsize() cannot be hit by concurrent WAL operations
readrecord returns None on partial/short reads (torn writes), treated as EOF; this is distinct from CRC mismatch which raises ValueError — the distinction separates "crash during write" from "data corruption."
If the standalone WAL's truncate() crashes mid-iteration over files, un-processed files retain old records that will be replayed on recovery; this is safe because PUT and DELETE are idempotent against an already-current store
The WAL protects user key-value writes but does not record SSTable-level state transitions (flush, compaction, level assignment), leaving metadata changes unprotected across crashes
The WAL syncs with os.fsync for durability guarantees only; it has no role in coordinating concurrent access, consistent with the single-threaded design
readrecord is a module-level function with no dependency on WriteAheadLog instance state, enabling its use during WAL construction/recovery before the object is fully initialized.
Each WAL record on disk is prefixed with a 4-byte little-endian uint32 length covering everything after itself, allowing the reader to atomically skip or validate entire records.
The recordlength field counts 21 + len(key) + len(value) bytes — everything after the 4-byte length prefix itself — so total on-disk size per record is recordlength + 4.
Every WAL record is prefixed with a 4-byte little-endian length covering all subsequent fields (CRC through value), enabling the reader to know exactly how many bytes to consume
BTree._init_ calls WAL.recover() unconditionally on every startup; an empty WAL short-circuits after a single read with no further I/O
recoverseqnum uses break on corruption (continues to next file) while readallrecords uses return (stops everything) — recovery is lenient to maximize the sequence counter, replay is strict to avoid replaying past corruption
WAL recovery fsyncs the data file via page_manager.sync() before truncating the WAL, so a crash during recovery itself leaves the WAL intact for re-replay
The WAL's recovery model guarantees a valid-prefix contract: both corruption (CRC mismatch halts replay returning prior valid records) and partial writes (treated as EOF, not error) cause replay to stop cleanly, always returning a consistent prefix of the written log.
WAL segment discovery during recovery uses os.listdir(), so any segment whose parent directory entry was not fsynced becomes invisible after a crash — connecting the dir-fsync gap to concrete data loss.
The WAL carries vestigial recovery-related infrastructure: sequence numbers are monotonically assigned under lock and stored in every record but never consulted for ordering, deduplication, or gap detection during recovery (file position alone determines replay order), and checkpoint records occupy positions in the same sequence space but are neither searched for during recovery scanning nor used as truncation watermarks.
recoverseq_num in write-ahead-log/wal.py reads every record in every WAL file sequentially from byte zero; there is no seek-to-end or block-skip optimization, making recovery O(file-size)
replay() returns all PUT/DELETE records regardless of whether they are followed by a COMMIT record, meaning uncommitted batches are replayed identically to committed ones.
replay() does not verify that batch operations have a corresponding COMMIT record; partial batches from a mid-write crash would be replayed as if committed, since replay filters only by record type, not by batch completeness
_rotate always fsyncs the outgoing segment before closing it, ensuring no buffered writes are lost during segment rotation.
If a segment file is manually deleted from the directory, _rotate can produce a filename that reuses a previously-used number, since it derives the next name solely from the current highest filename.
Both append and appendbatch call maybe_rotate() after write and sync complete; rotation never interrupts an in-progress write, and the file only rotates on the next operation
test_rotation asserts record count after rotation but does not verify that sequence numbers are monotonically increasing across file boundaries — the cross-file monotonicity invariant holds by construction but has no regression test.
WAL segment filenames are zero-padded 6-digit integers (000001.wal, 000002.wal, ...) derived from the highest existing filename plus one, which makes lexicographic sort equal numeric sort.
seqnum is a single in-memory counter that only increments via += 1; all records across all WAL files share one monotonically increasing sequence space that never resets, even across file rotation.
WAL seq_num increases monotonically across all records but is not stamped onto data pages, so there is no mechanism to detect whether a replayed operation was already applied — making replay non-idempotent against the underlying store.
On construction, recoverseq_num() scans all WAL files to find the maximum surviving sequence number, guaranteeing the monotonic counter never goes backward across crashes
WAL sequence numbers are monotonic and strictly increasing but entirely vestigial: they are computed under lock, stored in every record, and parsed during recovery, yet never consulted for ordering, deduplication, or gap detection — file position alone determines replay order.
Sequence numbers increment under a threading.Lock and are never reused; on recovery, all WAL files are scanned to find the high-water mark so new records continue the sequence.
The WAL sequence counter resets to 0 on every commit; sequence numbers are only meaningful within a single uncommitted transaction window
Sequence numbers begin at 1 for the first appended record; an empty WAL reports currentseqnum() == 0
The 4-byte sequence number field in each WAL entry is parsed during recover() but never consulted; replay order is determined solely by file offset
At most one file descriptor is open for WAL writes at any time; _rotate closes the old fd before opening the new one.
The WAL enforces single-writer via threading.Lock but has no inter-process locking mechanism (no flock/PID file), so the single-writer invariant holds only within a single OS process
openlatest checks on-disk file size via os.path.getsize, not an in-memory counter, making it correct across crash/restart boundaries.
WriteAheadLog defaults to sync_mode="sync", calling flush() + os.fsync() after every single append call — the safest and slowest mode.
With sync_mode="sync", WAL records are durable immediately after append returns — they survive process death even if close() is never called.
WriteAheadLog accepts any string as sync_mode without validation; values other than "sync" and "batch" silently disable all fsync.
The WAL carefully serializes all mutations (append, appendbatch, checkpoint, truncate) under a threading.Lock, but both callers that depend on it have no synchronization for their own shared state: the LSM tree mutates memtable, sstables, and immutablememtables without any locking or atomic swaps, and sstables specifically is mutated by both flush (append) and compact (full replacement) concurrently — the lock protects the log but not the data structures built from it.
truncate holds self._lock for its entire duration (close, rewrite, reopen), blocking all concurrent appends, replays, and iterates
truncate() flushes, fsyncs, and closes the current write file descriptor before iterating segments, preventing conflicts with files it may need to delete or rewrite
WAL files where every record has seqnum <= upto_seq are deleted entirely via os.remove rather than left as empty files on disk
truncate() iterates segments via walfiles() in oldest-first order, so a crash mid-truncation leaves a contiguous suffix of segments — preserving the recovery invariant that surviving files form a continuous sequence
During truncate, if a corrupt record is encountered in a WAL file, all subsequent records in that file are silently discarded regardless of their sequence number — even records with seqnum > upto_seq
truncate() flushes and fsyncs the current WAL file before scanning for records to remove, preventing data loss from buffered writes that haven't reached disk
truncate(n) removes records with seq_num <= n (inclusive upper bound), keeping only records strictly greater than n
truncate(seq) removes all records with sequence number less than or equal to seq, keeping only records where seq_num > seq
truncate does not validate that uptoseq is within range — passing a value beyond currentseqnum() silently deletes all records without error
truncate() rewrites WAL files in place without atomic rename, so a crash during truncation can leave the log in an inconsistent state.
WriteAheadLog.truncate(uptoseq) keeps records with seqnum > upto_seq and deletes only those at or below, enabling partial log reclamation tied to checkpoint boundaries.
truncate(uptoseq) takes an explicit sequence number parameter; the WAL never decides what to truncate on its own — the caller must provide the boundary
WAL truncation is only safe because it is called after the data the WAL protects has been durably written to the main store (SSTable flush or data file fsync); violating this ordering loses committed data with no recovery path
WriteAheadLog.truncate() reads and rewrites every segment file record-by-record rather than deleting whole segment files, making it O(total records) instead of O(segments)
WAL truncation combines three independent failure modes: it blocks all concurrent operations for its entire duration, silently discards all records after encountering corruption in a file, and is not crash-safe due to in-place file rewriting without atomic rename.
readrecord returns None for short reads (truncation) but raises ValueError for CRC mismatch (corruption), giving callers two distinct failure modes to handle differently during recovery
The WAL provides two distinct durability levels: individual appends respect the configured sync mode (potentially skipping fsync entirely), while batch operations always force fsync, meaning batch writes are strictly more durable than individual writes regardless of configuration.
WAL integrity uses zlib.crc32 (32-bit, non-cryptographic); it detects accidental corruption but not intentional tampering
PageManager.writemeta calls flush() but not os.fsync(), so metadata updates (nextfreepage, freelisthead) are not durable against power loss — compounding the WAL bypass with a durability gap on the direct write path.
PageManager.writepage truncates data exceeding pagesize to exactly pagesize bytes without raising an error, which would corrupt the node header's numkeys count
The write-skew-detection module's tests exist only in testertestssi.py, making it entirely untested under default pytest invocation since that filename does not match the test*.py / *test.py globs
The WAL's carefully engineered write-time durability infrastructure is systematically abandoned at recovery time: the two-tier durability model (per-write sync vs. batch-only fsync) loses all distinction during replay because recovery ignores tiers entirely, AND crash recovery is both broken (no safe path across any implementation) and unverified (no crash or async tests), meaning the engineering investment in write-time safety provides zero value when it is most needed.
PREPARE and COMMIT messages with non-matching digests are silently dropped and never count toward quorum thresholds; a Byzantine node sending bad digests cannot contribute to agreement
WRONGDIGEST mode produces "baddigest{nodeid}", so two Byzantine nodes with this mode produce different invalid digests rather than accidentally colluding on the same forged value
VectorClock._init_ strips all entries with value 0, ensuring two clocks with identical non-zero entries are always equal and hash-equal.
If keys and values lists passed to serializeleaf have mismatched lengths, zip silently truncates to the shorter list while num_keys in the header reflects the longer, producing a corrupt page
All 13 CRC call sites across the codebase use zlib.crc32, which implements the ISO 3309 polynomial (0xEDB88320); no module uses the Castagnoli polynomial (CRC-32C) that RocksDB and PostgreSQL prefer for hardware-accelerated checksumming.
Python's zlib.crc32(data, initial_value) accepts an initial CRC value, meaning chained checksums could be implemented in this codebase by passing the previous frame's CRC as the seed with no additional hashing infrastructure