Skip to content

fix(memory-pool): add CI smoke-test coverage and fix flaky assertion#2268

Open
tang-canran wants to merge 4 commits into
dora-rs:mainfrom
tang-canran:fix/issue-2264
Open

fix(memory-pool): add CI smoke-test coverage and fix flaky assertion#2268
tang-canran wants to merge 4 commits into
dora-rs:mainfrom
tang-canran:fix/issue-2264

Conversation

@tang-canran

Copy link
Copy Markdown
Contributor

Addresses automated review findings from issue #2264 on merged PR #2168.

Finding 1 — zero CI coverage for unsafe transport paths:

  • Add 6 smoke tests to tests/example-smoke.rs (all #[ignore] gated on torch + tqdm): cpu2cpu (networked + local), auto_cleanup, duplicate_free, read_after_free, write_after_free
  • Wire memory-pool into scripts/smoke-all.sh with proper torch gating
  • Mark memory-pool as 'covered' in the example coverage table

Finding 2 — probabilistically-flaky assertion in receiver.py:

  • Replace checksum-of-first-8-int64 change detection with a deterministic counter: sender stamps random_data[0] = i, receiver asserts torch_tensor[0] == i
  • Eliminates ~1/3000 per-comparison collision risk

Closes #2264

Addresses automated review findings from issue dora-rs#2264 on merged PR dora-rs#2168.

Finding 1 — zero CI coverage for unsafe transport paths:
- Add 6 smoke tests to tests/example-smoke.rs (all #[ignore] gated on
  torch + tqdm): cpu2cpu (networked + local), auto_cleanup,
  duplicate_free, read_after_free, write_after_free
- Wire memory-pool into scripts/smoke-all.sh with proper torch gating
- Mark memory-pool as 'covered' in the example coverage table

Finding 2 — probabilistically-flaky assertion in receiver.py:
- Replace checksum-of-first-8-int64 change detection with a
  deterministic counter: sender stamps random_data[0] = i,
  receiver asserts torch_tensor[0] == i
- Eliminates ~1/3000 per-comparison collision risk

Closes dora-rs#2264

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@trunk-io

trunk-io Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@tang-canran

Copy link
Copy Markdown
Contributor Author

@heyong4725 PR ready for review. This addresses both findings from #2264:

Finding 1 — added 6 smoke tests (cpu2cpu networked + local, plus 4 negative-lifecycle scenarios for the "warn, don't crash" contract), all #[ignore] gated on torch+tqdm
availability.

Finding 2 — replaced the checksum-of-first-8 assertion with a deterministic counter (sender stamps random_data[0] = i, receiver asserts tensor[0] == i), eliminating the ~1/3000
collision risk.

@tang-canran

Copy link
Copy Markdown
Contributor Author

cargo test --test example-smoke memory_pool -- --ignored --test-threads=1

running 6 tests
test smoke_local_memory_pool_auto_cleanup ... ok
test smoke_local_memory_pool_cpu2cpu ... ok
test smoke_local_memory_pool_duplicate_free ... ok
test smoke_local_memory_pool_read_after_free ... ok
test smoke_local_memory_pool_write_after_free ... ok
test smoke_memory_pool_cpu2cpu ... ok

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 64 filtered out

And the 7 example dataflows:

scenario result
cpu2cpu.yml ~6979 MB/s ✅
cpu2cuda.yml ~5648 MB/s ✅
cuda2cpu.yml ~8131 MB/s ✅
duplicate_free.yml expected warning, no crash ✅
read_after_free.yml expected warning, no crash ✅
write_after_free.yml expected warning, no crash ✅
auto_cleanup.yml cleanup logs, no crash ✅

Copy link
Copy Markdown
Collaborator

Automated review by Claudethis is a fully automated review (Claude Code); no human has vetted it. Treat as a suggestion, not an approval.

Reviewed the diff directly, not the description.

The Finding-2 change is sound: the sender.py counter stamp (random_data[0] = i) plus the receiver.py tensor[0] == i assertion is deterministic and removes the sum-collision risk. The assertion is correctly skipped for write_after_free, and the negative-lifecycle scenarios free only on the final iteration after the per-iteration check, so the == i invariant holds throughout. The cpu2cpu wiring is gated fine. No issues there.

I found one important issue with the four new negative-lifecycle tests (auto_cleanup, duplicate_free, read_after_free, write_after_free).

All four of those dataflow files hardcode receiver_device: cuda:

# examples/memory-pool/{auto_cleanup,duplicate_free,read_after_free,write_after_free}.yml
env:
  receiver_device: cuda

and receiver.py bails immediately when CUDA is absent:

if RECEIVER_DEVICE.startswith("cuda") and not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available for the configured receiver device.")

So these four scenarios require an actual NVIDIA GPU — but this PR gates them only on python3 -c "import torch, tqdm" (in scripts/smoke-all.sh) and labels the Rust tests #[ignore = "requires torch and tqdm"]. On a GPU-less machine that does have torch installed (the exact case the gate is meant to admit), smoke-all.sh will run all four and the receiver node will crash with RuntimeError: CUDA is not available, failing the suite. The "all 6 passed" run in the PR description only passes because it was run on a CUDA box.

This also makes the coverage-table entry misleading: these four belong with the same "needs NVIDIA CUDA" blocker as cuda2cpu/cpu2cuda, not under a torch-only gate.

Only cpu2cpu.yml is genuinely GPU-free (receiver_device: cpu). I'd either (a) drop the four CUDA scenarios from the torch-only wiring and keep just cpu2cpu (which is exactly what #2267 does), or (b) gate them on torch.cuda.is_available() rather than import torch, tqdm.

Separately: this PR overlaps #2267, which closes the same issue (#2264) with the identical sender.py/receiver.py/cpu2cpu changes — the two will conflict, so only one can land. Once the CUDA-gating above is sorted, this is the superset of the two.


Generated by Claude Code

@heyong4725 heyong4725 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

Finding 2 (flaky assertion) is correct and clean. Finding 1's intent is right, but as wired the new negative-lifecycle tests can't run on the environment the gate targets — see below.

🔴 Critical — 4 of the 5 "covered" scenarios require CUDA, not just torch

All six new tests are gated on torch + tqdm (the #[ignore] reason text and smoke-all.sh's python3 -c "import torch, tqdm"). But four of the wired scenarios — auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, write_after_free.yml — declare receiver_device: cuda, and receiver.py enforces:

if RECEIVER_DEVICE.startswith("cuda") and not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available for the configured receiver device.")

The example README confirms it: "The CUDA receiver scenarios require a working CUDA runtime."

Reproduced on a torch-2.5.1 / cuda.is_available() == False machine (exactly the gate's target): the guard raises RuntimeError before the event loop. Both run_smoke_test_local and smoke-all.sh's run_local fail on a non-zero dora run exit, so all four negative-lifecycle tests fail on any torch-but-no-CUDA machine — the precise machines the import torch gate is meant to enable.

Why it matters: Finding 1 of #2264 is "zero CI coverage for the unsafe transport paths." Those unsafe paths are exactly the free/read/write/cleanup lifecycle calls — the four CUDA scenarios. As wired they can't run on CPU CI, so the unsafe-path coverage stays effectively zero; only cpu2cpu (the safe throughput path) actually runs.

Recommended fix (actually achieves the goal): switch the four negative YAMLs to receiver_device: cpu. The pool lifecycle / "warn-don't-crash" contract lives in the daemon's device-independent pool bookkeeping, so CPU exercises it fully and the torch + tqdm gate becomes correct.

Alternative: keep the YAMLs but gate those four tests on torch.cuda.is_available() (not import torch) and fix the docs below — but then they remain un-run on CPU CI, so Finding 1 is not addressed in practice; say so rather than marking the example "covered."

🟡 Minor — coverage table overstates coverage

The new example-smoke.rs table entry implies only cuda2cpu/cpu2cuda need a GPU, but four of the five it lists as covered also need CUDA. Whichever fix is chosen, correct this comment and the #[ignore = "requires torch and tqdm"] reasons on the four CUDA tests to state the real requirement.

✅ Finding 2 — correct

Deterministic counter (random_data[0] = i → assert torch_tensor[0] == i) is a strictly stronger, collision-free replacement for the sum-of-8 checksum. Traced i==0 (register), i>0 (in-place write propagation), and the write_after_free skip — all correct.

Verdict

Finding 2 ready to merge. Finding 1 needs the four negative scenarios runnable on CPU (or honestly re-labelled) before they provide the CI coverage the PR claims.

Comment thread scripts/smoke-all.sh Outdated

echo ""
echo "=== Memory-pool CPU transport (requires torch) ==="
if python3 -c "import torch, tqdm" 2>/dev/null; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import torch, tqdm gate is insufficient for the four negative scenarios below: auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, and write_after_free.yml all set receiver_device: cuda, and receiver.py raises RuntimeError when torch.cuda.is_available() is false. On a torch-but-no-CUDA machine (this gate's target) those four run_local calls crash the receiver and fail the smoke run.

Recommended: switch the four negative YAMLs to receiver_device: cpu (the pool lifecycle contract is device-independent), which makes this gate correct. Otherwise gate those four on CUDA separately — but then they don't run here, leaving #2264 Finding 1 unaddressed on CPU CI.

Comment thread tests/example-smoke.rs
// | | duplicate_free, read_after_free, write_after_free} | |
// | | (#[ignore], run when torch+tqdm available); | |
// | | smoke-all.sh gates on `import torch`. | |
// | | cuda2cpu/cpu2cuda/etc blocked: needs NVIDIA CUDA. | |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies only cuda2cpu/cpu2cuda need a GPU, but auto_cleanup, duplicate_free, read_after_free, and write_after_free also set receiver_device: cuda and require CUDA. So four of the five scenarios listed as "covered" above cannot run on CPU CI as wired. Please correct this note (and the #[ignore = "requires torch and tqdm"] reasons on those four tests) to reflect the real requirement, or flip the YAMLs to receiver_device: cpu so the "covered" claim holds.

@heyong4725

Copy link
Copy Markdown
Collaborator

@tang-canran thanks for the PR — left a review (changes requested). Finding 2 (the deterministic counter fix) looks great and is good to merge as-is.

One blocker on Finding 1: four of the five scenarios marked "covered" (auto_cleanup, duplicate_free, read_after_free, write_after_free) are wired to receiver_device: cuda, so receiver.py raises RuntimeError on any torch-but-no-CUDA machine — i.e. they crash on exactly the runners the import torch gate targets, and the unsafe-path coverage #2264 asked for never actually runs on CPU CI. Recommended fix: flip those four YAMLs to receiver_device: cpu (the pool lifecycle contract is device-independent). Details in the inline comments. Happy to help if useful.

The four negative-lifecycle scenarios (duplicate_free, read_after_free,
write_after_free, auto_cleanup) previously hardcoded receiver_device: cuda,
which caused RuntimeError on any torch-but-no-CUDA machine — exactly the
environment the import-torch gate targets.

The pool lifecycle contract (register/write/read/free/cleanup) lives in
the daemon's device-independent bookkeeping, and the receiver.py
assertion is already device-agnostic. CPU path exercises the same
unsafe shmem/seqlock/free paths that the GPU path does.

Switching to receiver_device: cpu makes the torch-only gate correct and
allows these four scenarios to run on CPU CI, actually achieving the
goal of Finding 1 from dora-rs#2264.

Refs: dora-rs#2264, dora-rs#2268

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tang-canran

Copy link
Copy Markdown
Contributor Author

@heyong4725 Fixed — switched the four negative-lifecycle YAMLs to receiver_device: cpu.
The pool lifecycle (register/write/read/free/cleanup) is device-independent, so CPU
path exercises the same unsafe shmem/seqlock/free paths.

All 6 smoke tests pass on the updated branch:

  • cargo test --test example-smoke memory_pool -- --ignored --test-threads=1 → 6/6 ✅
  • All 4 negative dora run scenarios verified on CPU ✅

Copy link
Copy Markdown
Collaborator

Automated review by Claudethis is a fully automated review; no human has vetted it. Treat accordingly.

I re-reviewed after the latest commit (5441b9e). It flips auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, and write_after_free.yml from receiver_device: cuda to cpu, which directly addresses the earlier feedback that those four scenarios would raise RuntimeError on a torch-but-no-CUDA host (the gate is import torch, not CUDA availability) and so couldn't actually run under the new smoke coverage. With the flip they're genuinely CPU-runnable.

No new issues found in the latest diff:

  • The 5 referenced YAMLs all exist under examples/memory-pool/.
  • The counter fix is deterministic: sender stamps random_data[0] = i every iteration; receiver asserts tensor[0] == i, correctly skipped for write_after_free. Frees happen only on the final iteration after the per-iteration check, so the invariant holds.
  • Rust helper signatures and #[ignore] gating match conventions; smoke-all.sh wiring is consistent.

The two outstanding inline review threads (on scripts/smoke-all.sh and tests/example-smoke.rs re: the cuda devices) now describe pre-fix state and look stale/addressed — worth resolving.

One process note unchanged from before: this overlaps with #2267 and #2279, all closing #2264 — only one should land.


Generated by Claude Code

Copy link
Copy Markdown
Collaborator

Automated review by Claudethis is a fully automated review with no human in the loop; please verify before acting on it.

The receiver.py counter fix looks correct: the turn-based handshake guarantees the receiver reads value i at iteration i, and the write_after_free skip is preserved, so the deterministic tensor[0] == i check is a clean improvement over the ~3% sum-collision flakiness. No issue there.

I do see one problem with the smoke-test wiring, which is the part this PR exists to add:

The torch gate checks the wrong interpreter, so the new coverage won't actually exercise the path. In scripts/smoke-all.sh the new block gates on the host interpreter:

if python3 -c "import torch, tqdm" 2>/dev/null; then
    run_networked "memory-pool-cpu2cpu" ...

But cpu2cpu.yml's nodes are dep-less (path: sender.py / receiver.py, no build: step). Per this script's own header (lines 23-25), dep-less Python nodes "run in the ambient env, not a per-node managed env" — i.e. the target/smoke-venv that ensure_python_bindings activates. That venv is built with only the workspace dora bindings (maturin) plus uv venv --seed; it never installs torch/tqdm. So:

  • Once smoke-venv is activated (which happens as soon as any earlier --uv Python example runs), python3 is the venv interpreter — which lacks torch — so the gate evaluates false and the memory-pool tests are always skipped, even on a machine where the host Python has torch.
  • If torch were importable at the gate, the nodes would then run in an env that still lacks it and fail at import torch.

The cargo test --test example-smoke -- --ignored smoke_memory_pool path has the same gap: with --uv and no build: step there's nothing that provisions torch into the env the nodes execute in. Contrast PR #2279, which adds build: install steps to cpu2cpu.yml for exactly this reason (though its --index-url form has its own problem). Net: as written, the smoke coverage this PR adds is effectively unreachable unless the operator manually installs torch into smoke-venv, which nothing documents or does.

Suggestion: either add a build: step that installs torch/tqdm into the node env (so --uv provisions them), or have the gate/skip logic check the same interpreter the nodes will actually run under, rather than bare python3.

Minor, non-blocking: the four negative-lifecycle YAMLs were flipped to receiver_device: cpu (good — that resolves the earlier CUDA-on-non-CUDA failure), but examples/memory-pool/README.md still describes only the CUDA receiver scenarios and doesn't mention cpu2cpu.yml.

(Note: #2267 and #2279 also target #2264 with overlapping changes to these same files — only one can land without conflicts.)


Generated by Claude Code

Add  to cpu2cpu.yml
and the four negative-lifecycle YAMLs, so  provisions torch into
per-node managed environments. This fixes the gate gap reported by
phil-opp: smoke-all.sh activates a smoke-venv that lacks torch, so the
 gate always evaluated false.

Also update README.md to document cpu2cpu.yml and note that CPU
scenarios no longer require CUDA.

Refs: dora-rs#2264, dora-rs#2268

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copy link
Copy Markdown
Collaborator

🤖 Automated review by Claude (Claude Code). This is a fully automated review with no human in the loop — please treat it as a suggestion.

I looked at the new commit 8c3667c (the per-node build: steps + README update added after the last review). It partially addresses the earlier point, but two issues remain:

  1. The smoke-all.sh torch gate is still unreachable, so the coverage this PR exists to add does not actually run. The memory-pool block is gated on python3 -c "import torch, tqdm". By the time that block runs, an earlier --uv Python example has already activated target/smoke-venv (via ensure_python_bindings), and that venv only installs pyarrow numpy + -e apis/python/node — never torch/tqdm. So python3 is the venv interpreter, the gate evaluates false, and all six entries are logged as SKIP. The new build: steps fix what happens if a node runs, but not the gate that decides whether it runs. To actually exercise the unsafe transport path, either drop the gate and rely on the per-node build: steps, or point the check at the interpreter/env the nodes actually run in.

  2. The build: pip commands resolve numpy/tqdm from the PyTorch CPU index, which doesn't reliably host them. e.g. in cpu2cpu.yml: pip install torch --index-url https://download.pytorch.org/whl/cpu numpy tqdm. --index-url replaces PyPI entirely, so numpy/tqdm are then resolved only from download.pytorch.org/whl/cpu. That index mirrors torch's own pinned deps but is not a general index and does not reliably serve tqdm, so the build can fail with "no matching distribution for tqdm". The safe form is --extra-index-url (additive), or installing torch and the rest in separate pip install invocations.

The counter-based assertion change (sender stamps random_data[0] = i, receiver asserts tensor[0] == i) and the CUDA→CPU flip of the negative-lifecycle YAMLs both look correct to me.


Generated by Claude Code

1. Remove the smoke-all.sh torch gate — the host-side  check always evaluates false because smoke-venv
   (activated by earlier --uv Python examples) lacks torch.  Rely on
   per-node  steps instead; dora --uv provisions torch into
   each node's managed env.

2. Replace  with  in all five YAML
    steps.  replaces PyPI entirely, so numpy/
   tqdm are then resolved only from download.pytorch.org/whl/cpu —
   which doesn't reliably host them.  adds the
   CPU torch index while keeping PyPI as the primary source.

Refs: dora-rs#2264, dora-rs#2268

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copy link
Copy Markdown
Collaborator

Automated review by Claudethis is a fully automated review with no human in the loop. Treat it as a suggestion, not an approval.

Re-reviewed after commit 7414260, which landed just after the previous automated review. It addresses both points raised there:

  1. Unreachable gate removedscripts/smoke-all.sh no longer gates the memory-pool block on python3 -c "import torch, tqdm" (which always evaluated false inside the activated smoke-venv). The entries now run unconditionally and rely on the per-node build: steps; needs_uv() matches these YAMLs (they contain .py/pip install), so --uv provisions a per-node managed env and runs the build: step — torch/numpy/tqdm are now genuinely installed for both the networked and local runs.
  2. --index-url--extra-index-url — all five YAMLs now use the additive form, so numpy/tqdm resolve from PyPI while torch comes from download.pytorch.org/whl/cpu, avoiding the "no matching distribution" failure.

The counter-based assertion fix and the CUDA→CPU flip of the negative-lifecycle YAMLs remain correct. No new issues found in the latest diff.

(Process note, unchanged from earlier: this still overlaps #2267 / #2279, which also close #2264 — only one of the three can land without conflicts.)


Generated by Claude Code

@tang-canran

Copy link
Copy Markdown
Contributor Author

@heyong4725 All feedback from both you and phil-opp has been addressed across the last few commits. phil-opp's latest automated review (commit 7414260) confirmed no new issues.
Ready for re-review when you have a moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory-pool transport (#2168): 2,977-line unsafe/FFI feature merged with no integration test coverage + a flaky example assertion

3 participants