fix(memory-pool): replace flaky sum-based assertion with monotonic counter check by phil-opp · Pull Request #2267 · dora-rs/dora

phil-opp · 2026-06-18T03:14:30Z

Summary

Addresses the two concrete findings in #2264.

Finding 2 fix: eliminate the ~3% spurious failure rate in `receiver.py`

The old assertion summed the first 8 int64 elements (each in [0, 1000)) of two consecutive random tensors and asserted they differed. With a sum range of [0, 7992] and ~99 comparisons per run, two adjacent draws had a ~1/3000 collision probability, giving roughly 3% spurious test failure per run even when the transport worked perfectly.

Fix: stamp element [0] with the iteration counter i in the sender, and assert tensor[0] == i in the receiver. Zero collision probability.

sender.py:

random_data[0] = i  # monotonic counter lets receiver detect change without collision risk

receiver.py:

if SCENARIO != "write_after_free":
    actual = int(torch_tensor[0].item())
    assert actual == i, f"iteration {i}: tensor[0] expected {i}, got {actual}"

Finding 1 (partial): wire `cpu2cpu.yml` into the test harness

cpu2cpu.yml is the only GPU-free scenario in the memory-pool suite. Added:

tests/example-smoke.rs: two #[ignore] smoke tests (torch is not installed in standard PR CI; run explicitly with cargo test --test example-smoke -- --ignored smoke_memory_pool)
scripts/smoke-all.sh: conditional block that runs cpu2cpu.yml when python3 -c "import torch, tqdm" succeeds, skips otherwise
Coverage audit table updated

The CUDA-dependent scenarios (cpu2cuda, cuda2cpu) remain blocked by the need for NVIDIA hardware.

Closes #2264

Generated by Claude Code

…ssertion The receiver validated data propagation by summing the first 8 elements of a random tensor (range [0,1000) each). Adjacent draws had ~1/3000 collision probability; across 99 comparisons this gave ~3% spurious failure per run even when the transport worked correctly. Stamp element[0] with the iteration counter in the sender so the receiver can assert `tensor[0] == i` — deterministic with zero false positives. Also wire cpu2cpu.yml into the test harness (the only GPU-free scenario): - tests/example-smoke.rs: two #[ignore] tests (torch not in PR CI, run explicitly with `cargo test -- --ignored smoke_memory_pool`) - scripts/smoke-all.sh: conditional on `python3 -c "import torch"` Partially addresses #2264 (smoke tests for cpu2cpu; CUDA variants still require NVIDIA hardware). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QheETdNcNiZ6JZa8oz5sRK

trunk-io · 2026-06-18T03:14:33Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

phil-opp · 2026-06-18T06:34:44Z

Automated review by Claude — fully automated, no human has vetted this; treat as a suggestion.

No issues found. Reviewed the diff directly:

The sender stamps random_data[0] = i (int64, no overflow for the message range) and the receiver asserts tensor[0] == i, which is deterministic and eliminates the collision risk of the old sum-of-8 check. The assertion is correctly skipped for the write_after_free scenario where reading the tensor is intentionally unsafe.
Checking on every iteration (including i == 0) is strictly better than the old prev_sum is not None skip of the first comparison.
The smoke-test wiring is gated correctly: #[ignore] on the two Rust tests and a python3 -c "import torch, tqdm" guard in smoke-all.sh, so standard PR CI (no torch) is unaffected.

Test-only / example-only change with no production-code impact.

Generated by Claude Code

phil-opp · 2026-06-19T06:38:37Z

Automated review by Claude — fully automated review; no human has vetted this, please verify before acting.

No issues found. The counter fix is correct, the two smoke tests are #[ignore]-gated, and scripts/smoke-all.sh gates the run on python3 -c "import torch, tqdm" while only invoking the CPU-only cpu2cpu.yml (both nodes *_device: cpu), so CPU CI is safe.

One note: this overlaps with #2268 and #2279, which address the same #2264 findings. #2268 additionally wires the four *_device: cuda negative scenarios under the torch-only gate (a problem already flagged there); this PR avoids that by staying CPU-only. Worth consolidating to a single PR.

Generated by Claude Code

This was referenced Jun 18, 2026

fix(memory-pool): add CI smoke-test coverage and fix flaky assertion #2268

Open

fix(memory-pool): deterministic transport validation and smoke test coverage #2279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory-pool): replace flaky sum-based assertion with monotonic counter check#2267

fix(memory-pool): replace flaky sum-based assertion with monotonic counter check#2267
phil-opp wants to merge 1 commit into
mainfrom
claude/inspiring-hypatia-j8ktrl-memory-pool-tests

phil-opp commented Jun 18, 2026

Uh oh!

trunk-io Bot commented Jun 18, 2026

Uh oh!

phil-opp commented Jun 18, 2026

Uh oh!

phil-opp commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phil-opp commented Jun 18, 2026

Summary

Finding 2 fix: eliminate the ~3% spurious failure rate in receiver.py

Finding 1 (partial): wire cpu2cpu.yml into the test harness

Uh oh!

trunk-io Bot commented Jun 18, 2026

Uh oh!

phil-opp commented Jun 18, 2026

Uh oh!

phil-opp commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Finding 2 fix: eliminate the ~3% spurious failure rate in `receiver.py`

Finding 1 (partial): wire `cpu2cpu.yml` into the test harness