Skip to content

fix(memory-pool): replace flaky sum-based assertion with monotonic counter check#2267

Open
phil-opp wants to merge 1 commit into
mainfrom
claude/inspiring-hypatia-j8ktrl-memory-pool-tests
Open

fix(memory-pool): replace flaky sum-based assertion with monotonic counter check#2267
phil-opp wants to merge 1 commit into
mainfrom
claude/inspiring-hypatia-j8ktrl-memory-pool-tests

Conversation

@phil-opp

Copy link
Copy Markdown
Collaborator

Summary

Addresses the two concrete findings in #2264.

Finding 2 fix: eliminate the ~3% spurious failure rate in receiver.py

The old assertion summed the first 8 int64 elements (each in [0, 1000)) of two consecutive random tensors and asserted they differed. With a sum range of [0, 7992] and ~99 comparisons per run, two adjacent draws had a ~1/3000 collision probability, giving roughly 3% spurious test failure per run even when the transport worked perfectly.

Fix: stamp element [0] with the iteration counter i in the sender, and assert tensor[0] == i in the receiver. Zero collision probability.

sender.py:

random_data[0] = i  # monotonic counter lets receiver detect change without collision risk

receiver.py:

if SCENARIO != "write_after_free":
    actual = int(torch_tensor[0].item())
    assert actual == i, f"iteration {i}: tensor[0] expected {i}, got {actual}"

Finding 1 (partial): wire cpu2cpu.yml into the test harness

cpu2cpu.yml is the only GPU-free scenario in the memory-pool suite. Added:

  • tests/example-smoke.rs: two #[ignore] smoke tests (torch is not installed in standard PR CI; run explicitly with cargo test --test example-smoke -- --ignored smoke_memory_pool)
  • scripts/smoke-all.sh: conditional block that runs cpu2cpu.yml when python3 -c "import torch, tqdm" succeeds, skips otherwise
  • Coverage audit table updated

The CUDA-dependent scenarios (cpu2cuda, cuda2cpu) remain blocked by the need for NVIDIA hardware.

Closes #2264


Generated by Claude Code

…ssertion

The receiver validated data propagation by summing the first 8 elements
of a random tensor (range [0,1000) each). Adjacent draws had ~1/3000
collision probability; across 99 comparisons this gave ~3% spurious
failure per run even when the transport worked correctly.

Stamp element[0] with the iteration counter in the sender so the
receiver can assert `tensor[0] == i` — deterministic with zero false
positives.

Also wire cpu2cpu.yml into the test harness (the only GPU-free scenario):
- tests/example-smoke.rs: two #[ignore] tests (torch not in PR CI, run
  explicitly with `cargo test -- --ignored smoke_memory_pool`)
- scripts/smoke-all.sh: conditional on `python3 -c "import torch"`

Partially addresses #2264 (smoke tests for cpu2cpu; CUDA variants still
require NVIDIA hardware).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QheETdNcNiZ6JZa8oz5sRK
@trunk-io

trunk-io Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

Copy link
Copy Markdown
Collaborator Author

Automated review by Claudefully automated, no human has vetted this; treat as a suggestion.

No issues found. Reviewed the diff directly:

  • The sender stamps random_data[0] = i (int64, no overflow for the message range) and the receiver asserts tensor[0] == i, which is deterministic and eliminates the collision risk of the old sum-of-8 check. The assertion is correctly skipped for the write_after_free scenario where reading the tensor is intentionally unsafe.
  • Checking on every iteration (including i == 0) is strictly better than the old prev_sum is not None skip of the first comparison.
  • The smoke-test wiring is gated correctly: #[ignore] on the two Rust tests and a python3 -c "import torch, tqdm" guard in smoke-all.sh, so standard PR CI (no torch) is unaffected.

Test-only / example-only change with no production-code impact.


Generated by Claude Code

Copy link
Copy Markdown
Collaborator Author

Automated review by Claudefully automated review; no human has vetted this, please verify before acting.

No issues found. The counter fix is correct, the two smoke tests are #[ignore]-gated, and scripts/smoke-all.sh gates the run on python3 -c "import torch, tqdm" while only invoking the CPU-only cpu2cpu.yml (both nodes *_device: cpu), so CPU CI is safe.

One note: this overlaps with #2268 and #2279, which address the same #2264 findings. #2268 additionally wires the four *_device: cuda negative scenarios under the torch-only gate (a problem already flagged there); this PR avoids that by staying CPU-only. Worth consolidating to a single PR.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory-pool transport (#2168): 2,977-line unsafe/FFI feature merged with no integration test coverage + a flaky example assertion

2 participants