fix(memory-pool): add CI smoke-test coverage and fix flaky assertion by tang-canran · Pull Request #2268 · dora-rs/dora

tang-canran · 2026-06-18T13:30:43Z

Addresses automated review findings from issue #2264 on merged PR #2168.

Finding 1 — zero CI coverage for unsafe transport paths:

Add 6 smoke tests to tests/example-smoke.rs (all #[ignore] gated on torch + tqdm): cpu2cpu (networked + local), auto_cleanup, duplicate_free, read_after_free, write_after_free
Wire memory-pool into scripts/smoke-all.sh with proper torch gating
Mark memory-pool as 'covered' in the example coverage table

Finding 2 — probabilistically-flaky assertion in receiver.py:

Replace checksum-of-first-8-int64 change detection with a deterministic counter: sender stamps random_data[0] = i, receiver asserts torch_tensor[0] == i
Eliminates ~1/3000 per-comparison collision risk

Addresses automated review findings from issue dora-rs#2264 on merged PR dora-rs#2168. Finding 1 — zero CI coverage for unsafe transport paths: - Add 6 smoke tests to tests/example-smoke.rs (all #[ignore] gated on torch + tqdm): cpu2cpu (networked + local), auto_cleanup, duplicate_free, read_after_free, write_after_free - Wire memory-pool into scripts/smoke-all.sh with proper torch gating - Mark memory-pool as 'covered' in the example coverage table Finding 2 — probabilistically-flaky assertion in receiver.py: - Replace checksum-of-first-8-int64 change detection with a deterministic counter: sender stamps random_data[0] = i, receiver asserts torch_tensor[0] == i - Eliminates ~1/3000 per-comparison collision risk Closes dora-rs#2264 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

trunk-io · 2026-06-18T13:30:48Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

tang-canran · 2026-06-18T13:39:55Z

@heyong4725 PR ready for review. This addresses both findings from #2264:

Finding 1 — added 6 smoke tests (cpu2cpu networked + local, plus 4 negative-lifecycle scenarios for the "warn, don't crash" contract), all #[ignore] gated on torch+tqdm
availability.

Finding 2 — replaced the checksum-of-first-8 assertion with a deterministic counter (sender stamps random_data[0] = i, receiver asserts tensor[0] == i), eliminating the ~1/3000
collision risk.

tang-canran · 2026-06-18T13:51:18Z

cargo test --test example-smoke memory_pool -- --ignored --test-threads=1：

running 6 tests
test smoke_local_memory_pool_auto_cleanup ... ok
test smoke_local_memory_pool_cpu2cpu ... ok
test smoke_local_memory_pool_duplicate_free ... ok
test smoke_local_memory_pool_read_after_free ... ok
test smoke_local_memory_pool_write_after_free ... ok
test smoke_memory_pool_cpu2cpu ... ok

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 64 filtered out

And the 7 example dataflows:

scenario	result
cpu2cpu.yml	~6979 MB/s ✅
cpu2cuda.yml	~5648 MB/s ✅
cuda2cpu.yml	~8131 MB/s ✅
duplicate_free.yml	expected warning, no crash ✅
read_after_free.yml	expected warning, no crash ✅
write_after_free.yml	expected warning, no crash ✅
auto_cleanup.yml	cleanup logs, no crash ✅

phil-opp · 2026-06-18T14:36:10Z

Automated review by Claude — this is a fully automated review (Claude Code); no human has vetted it. Treat as a suggestion, not an approval.

Reviewed the diff directly, not the description.

The Finding-2 change is sound: the sender.py counter stamp (random_data[0] = i) plus the receiver.py tensor[0] == i assertion is deterministic and removes the sum-collision risk. The assertion is correctly skipped for write_after_free, and the negative-lifecycle scenarios free only on the final iteration after the per-iteration check, so the == i invariant holds throughout. The cpu2cpu wiring is gated fine. No issues there.

I found one important issue with the four new negative-lifecycle tests (auto_cleanup, duplicate_free, read_after_free, write_after_free).

All four of those dataflow files hardcode receiver_device: cuda:

# examples/memory-pool/{auto_cleanup,duplicate_free,read_after_free,write_after_free}.yml
env:
  receiver_device: cuda

and receiver.py bails immediately when CUDA is absent:

if RECEIVER_DEVICE.startswith("cuda") and not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available for the configured receiver device.")

So these four scenarios require an actual NVIDIA GPU — but this PR gates them only on python3 -c "import torch, tqdm" (in scripts/smoke-all.sh) and labels the Rust tests #[ignore = "requires torch and tqdm"]. On a GPU-less machine that does have torch installed (the exact case the gate is meant to admit), smoke-all.sh will run all four and the receiver node will crash with RuntimeError: CUDA is not available, failing the suite. The "all 6 passed" run in the PR description only passes because it was run on a CUDA box.

This also makes the coverage-table entry misleading: these four belong with the same "needs NVIDIA CUDA" blocker as cuda2cpu/cpu2cuda, not under a torch-only gate.

Only cpu2cpu.yml is genuinely GPU-free (receiver_device: cpu). I'd either (a) drop the four CUDA scenarios from the torch-only wiring and keep just cpu2cpu (which is exactly what #2267 does), or (b) gate them on torch.cuda.is_available() rather than import torch, tqdm.

Separately: this PR overlaps #2267, which closes the same issue (#2264) with the identical sender.py/receiver.py/cpu2cpu changes — the two will conflict, so only one can land. Once the CUDA-gating above is sorted, this is the superset of the two.

Generated by Claude Code

heyong4725

Review summary

Finding 2 (flaky assertion) is correct and clean. Finding 1's intent is right, but as wired the new negative-lifecycle tests can't run on the environment the gate targets — see below.

🔴 Critical — 4 of the 5 "covered" scenarios require CUDA, not just torch

All six new tests are gated on torch + tqdm (the #[ignore] reason text and smoke-all.sh's python3 -c "import torch, tqdm"). But four of the wired scenarios — auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, write_after_free.yml — declare receiver_device: cuda, and receiver.py enforces:

if RECEIVER_DEVICE.startswith("cuda") and not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available for the configured receiver device.")

The example README confirms it: "The CUDA receiver scenarios require a working CUDA runtime."

Reproduced on a torch-2.5.1 / cuda.is_available() == False machine (exactly the gate's target): the guard raises RuntimeError before the event loop. Both run_smoke_test_local and smoke-all.sh's run_local fail on a non-zero dora run exit, so all four negative-lifecycle tests fail on any torch-but-no-CUDA machine — the precise machines the import torch gate is meant to enable.

Why it matters: Finding 1 of #2264 is "zero CI coverage for the unsafe transport paths." Those unsafe paths are exactly the free/read/write/cleanup lifecycle calls — the four CUDA scenarios. As wired they can't run on CPU CI, so the unsafe-path coverage stays effectively zero; only cpu2cpu (the safe throughput path) actually runs.

Recommended fix (actually achieves the goal): switch the four negative YAMLs to receiver_device: cpu. The pool lifecycle / "warn-don't-crash" contract lives in the daemon's device-independent pool bookkeeping, so CPU exercises it fully and the torch + tqdm gate becomes correct.

Alternative: keep the YAMLs but gate those four tests on torch.cuda.is_available() (not import torch) and fix the docs below — but then they remain un-run on CPU CI, so Finding 1 is not addressed in practice; say so rather than marking the example "covered."

🟡 Minor — coverage table overstates coverage

The new example-smoke.rs table entry implies only cuda2cpu/cpu2cuda need a GPU, but four of the five it lists as covered also need CUDA. Whichever fix is chosen, correct this comment and the #[ignore = "requires torch and tqdm"] reasons on the four CUDA tests to state the real requirement.

✅ Finding 2 — correct

Deterministic counter (random_data[0] = i → assert torch_tensor[0] == i) is a strictly stronger, collision-free replacement for the sum-of-8 checksum. Traced i==0 (register), i>0 (in-place write propagation), and the write_after_free skip — all correct.

Verdict

Finding 2 ready to merge. Finding 1 needs the four negative scenarios runnable on CPU (or honestly re-labelled) before they provide the CI coverage the PR claims.

heyong4725 · 2026-06-18T14:54:34Z

+
+    echo ""
+    echo "=== Memory-pool CPU transport (requires torch) ==="
+    if python3 -c "import torch, tqdm" 2>/dev/null; then


This import torch, tqdm gate is insufficient for the four negative scenarios below: auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, and write_after_free.yml all set receiver_device: cuda, and receiver.py raises RuntimeError when torch.cuda.is_available() is false. On a torch-but-no-CUDA machine (this gate's target) those four run_local calls crash the receiver and fail the smoke run.

Recommended: switch the four negative YAMLs to receiver_device: cpu (the pool lifecycle contract is device-independent), which makes this gate correct. Otherwise gate those four on CUDA separately — but then they don't run here, leaving #2264 Finding 1 unaddressed on CPU CI.

heyong4725 · 2026-06-18T14:54:36Z

+// |                           | duplicate_free, read_after_free, write_after_free}   |          |
+// |                           | (#[ignore], run when torch+tqdm available);          |          |
+// |                           | smoke-all.sh gates on `import torch`.                 |          |
+// |                           | cuda2cpu/cpu2cuda/etc blocked: needs NVIDIA CUDA.     |          |


This implies only cuda2cpu/cpu2cuda need a GPU, but auto_cleanup, duplicate_free, read_after_free, and write_after_free also set receiver_device: cuda and require CUDA. So four of the five scenarios listed as "covered" above cannot run on CPU CI as wired. Please correct this note (and the #[ignore = "requires torch and tqdm"] reasons on those four tests) to reflect the real requirement, or flip the YAMLs to receiver_device: cpu so the "covered" claim holds.

heyong4725 · 2026-06-18T14:59:25Z

@tang-canran thanks for the PR — left a review (changes requested). Finding 2 (the deterministic counter fix) looks great and is good to merge as-is.

One blocker on Finding 1: four of the five scenarios marked "covered" (auto_cleanup, duplicate_free, read_after_free, write_after_free) are wired to receiver_device: cuda, so receiver.py raises RuntimeError on any torch-but-no-CUDA machine — i.e. they crash on exactly the runners the import torch gate targets, and the unsafe-path coverage #2264 asked for never actually runs on CPU CI. Recommended fix: flip those four YAMLs to receiver_device: cpu (the pool lifecycle contract is device-independent). Details in the inline comments. Happy to help if useful.

The four negative-lifecycle scenarios (duplicate_free, read_after_free, write_after_free, auto_cleanup) previously hardcoded receiver_device: cuda, which caused RuntimeError on any torch-but-no-CUDA machine — exactly the environment the import-torch gate targets. The pool lifecycle contract (register/write/read/free/cleanup) lives in the daemon's device-independent bookkeeping, and the receiver.py assertion is already device-agnostic. CPU path exercises the same unsafe shmem/seqlock/free paths that the GPU path does. Switching to receiver_device: cpu makes the torch-only gate correct and allows these four scenarios to run on CPU CI, actually achieving the goal of Finding 1 from dora-rs#2264. Refs: dora-rs#2264, dora-rs#2268 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tang-canran · 2026-06-19T08:27:32Z

@heyong4725 Fixed — switched the four negative-lifecycle YAMLs to receiver_device: cpu.
The pool lifecycle (register/write/read/free/cleanup) is device-independent, so CPU
path exercises the same unsafe shmem/seqlock/free paths.

All 6 smoke tests pass on the updated branch:

cargo test --test example-smoke memory_pool -- --ignored --test-threads=1 → 6/6 ✅
All 4 negative dora run scenarios verified on CPU ✅

phil-opp · 2026-06-19T08:35:58Z

Automated review by Claude — this is a fully automated review; no human has vetted it. Treat accordingly.

I re-reviewed after the latest commit (5441b9e). It flips auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, and write_after_free.yml from receiver_device: cuda to cpu, which directly addresses the earlier feedback that those four scenarios would raise RuntimeError on a torch-but-no-CUDA host (the gate is import torch, not CUDA availability) and so couldn't actually run under the new smoke coverage. With the flip they're genuinely CPU-runnable.

No new issues found in the latest diff:

The 5 referenced YAMLs all exist under examples/memory-pool/.
The counter fix is deterministic: sender stamps random_data[0] = i every iteration; receiver asserts tensor[0] == i, correctly skipped for write_after_free. Frees happen only on the final iteration after the per-iteration check, so the invariant holds.
Rust helper signatures and #[ignore] gating match conventions; smoke-all.sh wiring is consistent.

The two outstanding inline review threads (on scripts/smoke-all.sh and tests/example-smoke.rs re: the cuda devices) now describe pre-fix state and look stale/addressed — worth resolving.

One process note unchanged from before: this overlaps with #2267 and #2279, all closing #2264 — only one should land.

Generated by Claude Code

phil-opp · 2026-06-20T14:37:49Z

Automated review by Claude — this is a fully automated review with no human in the loop; please verify before acting on it.

The receiver.py counter fix looks correct: the turn-based handshake guarantees the receiver reads value i at iteration i, and the write_after_free skip is preserved, so the deterministic tensor[0] == i check is a clean improvement over the ~3% sum-collision flakiness. No issue there.

I do see one problem with the smoke-test wiring, which is the part this PR exists to add:

The torch gate checks the wrong interpreter, so the new coverage won't actually exercise the path. In scripts/smoke-all.sh the new block gates on the host interpreter:

if python3 -c "import torch, tqdm" 2>/dev/null; then
    run_networked "memory-pool-cpu2cpu" ...

But cpu2cpu.yml's nodes are dep-less (path: sender.py / receiver.py, no build: step). Per this script's own header (lines 23-25), dep-less Python nodes "run in the ambient env, not a per-node managed env" — i.e. the target/smoke-venv that ensure_python_bindings activates. That venv is built with only the workspace dora bindings (maturin) plus uv venv --seed; it never installs torch/tqdm. So:

Once smoke-venv is activated (which happens as soon as any earlier --uv Python example runs), python3 is the venv interpreter — which lacks torch — so the gate evaluates false and the memory-pool tests are always skipped, even on a machine where the host Python has torch.
If torch were importable at the gate, the nodes would then run in an env that still lacks it and fail at import torch.

The cargo test --test example-smoke -- --ignored smoke_memory_pool path has the same gap: with --uv and no build: step there's nothing that provisions torch into the env the nodes execute in. Contrast PR #2279, which adds build: install steps to cpu2cpu.yml for exactly this reason (though its --index-url form has its own problem). Net: as written, the smoke coverage this PR adds is effectively unreachable unless the operator manually installs torch into smoke-venv, which nothing documents or does.

Suggestion: either add a build: step that installs torch/tqdm into the node env (so --uv provisions them), or have the gate/skip logic check the same interpreter the nodes will actually run under, rather than bare python3.

Minor, non-blocking: the four negative-lifecycle YAMLs were flipped to receiver_device: cpu (good — that resolves the earlier CUDA-on-non-CUDA failure), but examples/memory-pool/README.md still describes only the CUDA receiver scenarios and doesn't mention cpu2cpu.yml.

(Note: #2267 and #2279 also target #2264 with overlapping changes to these same files — only one can land without conflicts.)

Generated by Claude Code

Add to cpu2cpu.yml and the four negative-lifecycle YAMLs, so provisions torch into per-node managed environments. This fixes the gate gap reported by phil-opp: smoke-all.sh activates a smoke-venv that lacks torch, so the gate always evaluated false. Also update README.md to document cpu2cpu.yml and note that CPU scenarios no longer require CUDA. Refs: dora-rs#2264, dora-rs#2268 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

phil-opp · 2026-06-21T08:38:26Z

🤖 Automated review by Claude (Claude Code). This is a fully automated review with no human in the loop — please treat it as a suggestion.

I looked at the new commit 8c3667c (the per-node build: steps + README update added after the last review). It partially addresses the earlier point, but two issues remain:

The smoke-all.sh torch gate is still unreachable, so the coverage this PR exists to add does not actually run. The memory-pool block is gated on python3 -c "import torch, tqdm". By the time that block runs, an earlier --uv Python example has already activated target/smoke-venv (via ensure_python_bindings), and that venv only installs pyarrow numpy + -e apis/python/node — never torch/tqdm. So python3 is the venv interpreter, the gate evaluates false, and all six entries are logged as SKIP. The new build: steps fix what happens if a node runs, but not the gate that decides whether it runs. To actually exercise the unsafe transport path, either drop the gate and rely on the per-node build: steps, or point the check at the interpreter/env the nodes actually run in.
The build: pip commands resolve numpy/tqdm from the PyTorch CPU index, which doesn't reliably host them. e.g. in cpu2cpu.yml: pip install torch --index-url https://download.pytorch.org/whl/cpu numpy tqdm. --index-url replaces PyPI entirely, so numpy/tqdm are then resolved only from download.pytorch.org/whl/cpu. That index mirrors torch's own pinned deps but is not a general index and does not reliably serve tqdm, so the build can fail with "no matching distribution for tqdm". The safe form is --extra-index-url (additive), or installing torch and the rest in separate pip install invocations.

The counter-based assertion change (sender stamps random_data[0] = i, receiver asserts tensor[0] == i) and the CUDA→CPU flip of the negative-lifecycle YAMLs both look correct to me.

Generated by Claude Code

1. Remove the smoke-all.sh torch gate — the host-side check always evaluates false because smoke-venv (activated by earlier --uv Python examples) lacks torch. Rely on per-node steps instead; dora --uv provisions torch into each node's managed env. 2. Replace with in all five YAML steps. replaces PyPI entirely, so numpy/ tqdm are then resolved only from download.pytorch.org/whl/cpu — which doesn't reliably host them. adds the CPU torch index while keeping PyPI as the primary source. Refs: dora-rs#2264, dora-rs#2268 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

phil-opp · 2026-06-21T10:38:32Z

Automated review by Claude — this is a fully automated review with no human in the loop. Treat it as a suggestion, not an approval.

Re-reviewed after commit 7414260, which landed just after the previous automated review. It addresses both points raised there:

Unreachable gate removed — scripts/smoke-all.sh no longer gates the memory-pool block on python3 -c "import torch, tqdm" (which always evaluated false inside the activated smoke-venv). The entries now run unconditionally and rely on the per-node build: steps; needs_uv() matches these YAMLs (they contain .py/pip install), so --uv provisions a per-node managed env and runs the build: step — torch/numpy/tqdm are now genuinely installed for both the networked and local runs.
--index-url → --extra-index-url — all five YAMLs now use the additive form, so numpy/tqdm resolve from PyPI while torch comes from download.pytorch.org/whl/cpu, avoiding the "no matching distribution" failure.

The counter-based assertion fix and the CUDA→CPU flip of the negative-lifecycle YAMLs remain correct. No new issues found in the latest diff.

(Process note, unchanged from earlier: this still overlaps #2267 / #2279, which also close #2264 — only one of the three can land without conflicts.)

Generated by Claude Code

tang-canran · 2026-06-21T11:27:09Z

@heyong4725 All feedback from both you and phil-opp has been addressed across the last few commits. phil-opp's latest automated review (commit 7414260) confirmed no new issues.
Ready for re-review when you have a moment.

heyong4725 requested changes Jun 18, 2026

View reviewed changes

heyong4725 reviewed Jun 18, 2026

View reviewed changes

This was referenced Jun 19, 2026

fix(memory-pool): deterministic transport validation and smoke test coverage #2279

Open

fix(memory-pool): replace flaky sum-based assertion with monotonic counter check #2267

Open

Conversation

tang-canran commented Jun 18, 2026

Uh oh!

trunk-io Bot commented Jun 18, 2026

Uh oh!

tang-canran commented Jun 18, 2026

Uh oh!

tang-canran commented Jun 18, 2026

Uh oh!

phil-opp commented Jun 18, 2026

Uh oh!

heyong4725 left a comment

Choose a reason for hiding this comment

Review summary

🔴 Critical — 4 of the 5 "covered" scenarios require CUDA, not just torch

🟡 Minor — coverage table overstates coverage

✅ Finding 2 — correct

Verdict

Uh oh!

heyong4725 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

heyong4725 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

heyong4725 commented Jun 18, 2026

Uh oh!

tang-canran commented Jun 19, 2026

Uh oh!

phil-opp commented Jun 19, 2026

Uh oh!

phil-opp commented Jun 20, 2026

Uh oh!

phil-opp commented Jun 21, 2026

Uh oh!

phil-opp commented Jun 21, 2026

Uh oh!

tang-canran commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants