fix(memory-pool): add CI smoke-test coverage and fix flaky assertion#2268
fix(memory-pool): add CI smoke-test coverage and fix flaky assertion#2268tang-canran wants to merge 4 commits into
Conversation
Addresses automated review findings from issue dora-rs#2264 on merged PR dora-rs#2168. Finding 1 — zero CI coverage for unsafe transport paths: - Add 6 smoke tests to tests/example-smoke.rs (all #[ignore] gated on torch + tqdm): cpu2cpu (networked + local), auto_cleanup, duplicate_free, read_after_free, write_after_free - Wire memory-pool into scripts/smoke-all.sh with proper torch gating - Mark memory-pool as 'covered' in the example coverage table Finding 2 — probabilistically-flaky assertion in receiver.py: - Replace checksum-of-first-8-int64 change detection with a deterministic counter: sender stamps random_data[0] = i, receiver asserts torch_tensor[0] == i - Eliminates ~1/3000 per-comparison collision risk Closes dora-rs#2264 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
@heyong4725 PR ready for review. This addresses both findings from #2264: Finding 1 — added 6 smoke tests (cpu2cpu networked + local, plus 4 negative-lifecycle scenarios for the "warn, don't crash" contract), all Finding 2 — replaced the checksum-of-first-8 assertion with a deterministic counter (sender stamps |
|
And the 7 example dataflows:
|
|
Automated review by Claude — this is a fully automated review (Claude Code); no human has vetted it. Treat as a suggestion, not an approval. Reviewed the diff directly, not the description. The Finding-2 change is sound: the I found one important issue with the four new negative-lifecycle tests ( All four of those dataflow files hardcode # examples/memory-pool/{auto_cleanup,duplicate_free,read_after_free,write_after_free}.yml
env:
receiver_device: cudaand if RECEIVER_DEVICE.startswith("cuda") and not torch.cuda.is_available():
raise RuntimeError("CUDA is not available for the configured receiver device.")So these four scenarios require an actual NVIDIA GPU — but this PR gates them only on This also makes the coverage-table entry misleading: these four belong with the same "needs NVIDIA CUDA" blocker as Only Separately: this PR overlaps #2267, which closes the same issue (#2264) with the identical Generated by Claude Code |
heyong4725
left a comment
There was a problem hiding this comment.
Review summary
Finding 2 (flaky assertion) is correct and clean. Finding 1's intent is right, but as wired the new negative-lifecycle tests can't run on the environment the gate targets — see below.
🔴 Critical — 4 of the 5 "covered" scenarios require CUDA, not just torch
All six new tests are gated on torch + tqdm (the #[ignore] reason text and smoke-all.sh's python3 -c "import torch, tqdm"). But four of the wired scenarios — auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, write_after_free.yml — declare receiver_device: cuda, and receiver.py enforces:
if RECEIVER_DEVICE.startswith("cuda") and not torch.cuda.is_available():
raise RuntimeError("CUDA is not available for the configured receiver device.")The example README confirms it: "The CUDA receiver scenarios require a working CUDA runtime."
Reproduced on a torch-2.5.1 / cuda.is_available() == False machine (exactly the gate's target): the guard raises RuntimeError before the event loop. Both run_smoke_test_local and smoke-all.sh's run_local fail on a non-zero dora run exit, so all four negative-lifecycle tests fail on any torch-but-no-CUDA machine — the precise machines the import torch gate is meant to enable.
Why it matters: Finding 1 of #2264 is "zero CI coverage for the unsafe transport paths." Those unsafe paths are exactly the free/read/write/cleanup lifecycle calls — the four CUDA scenarios. As wired they can't run on CPU CI, so the unsafe-path coverage stays effectively zero; only cpu2cpu (the safe throughput path) actually runs.
Recommended fix (actually achieves the goal): switch the four negative YAMLs to receiver_device: cpu. The pool lifecycle / "warn-don't-crash" contract lives in the daemon's device-independent pool bookkeeping, so CPU exercises it fully and the torch + tqdm gate becomes correct.
Alternative: keep the YAMLs but gate those four tests on torch.cuda.is_available() (not import torch) and fix the docs below — but then they remain un-run on CPU CI, so Finding 1 is not addressed in practice; say so rather than marking the example "covered."
🟡 Minor — coverage table overstates coverage
The new example-smoke.rs table entry implies only cuda2cpu/cpu2cuda need a GPU, but four of the five it lists as covered also need CUDA. Whichever fix is chosen, correct this comment and the #[ignore = "requires torch and tqdm"] reasons on the four CUDA tests to state the real requirement.
✅ Finding 2 — correct
Deterministic counter (random_data[0] = i → assert torch_tensor[0] == i) is a strictly stronger, collision-free replacement for the sum-of-8 checksum. Traced i==0 (register), i>0 (in-place write propagation), and the write_after_free skip — all correct.
Verdict
Finding 2 ready to merge. Finding 1 needs the four negative scenarios runnable on CPU (or honestly re-labelled) before they provide the CI coverage the PR claims.
|
|
||
| echo "" | ||
| echo "=== Memory-pool CPU transport (requires torch) ===" | ||
| if python3 -c "import torch, tqdm" 2>/dev/null; then |
There was a problem hiding this comment.
This import torch, tqdm gate is insufficient for the four negative scenarios below: auto_cleanup.yml, duplicate_free.yml, read_after_free.yml, and write_after_free.yml all set receiver_device: cuda, and receiver.py raises RuntimeError when torch.cuda.is_available() is false. On a torch-but-no-CUDA machine (this gate's target) those four run_local calls crash the receiver and fail the smoke run.
Recommended: switch the four negative YAMLs to receiver_device: cpu (the pool lifecycle contract is device-independent), which makes this gate correct. Otherwise gate those four on CUDA separately — but then they don't run here, leaving #2264 Finding 1 unaddressed on CPU CI.
| // | | duplicate_free, read_after_free, write_after_free} | | | ||
| // | | (#[ignore], run when torch+tqdm available); | | | ||
| // | | smoke-all.sh gates on `import torch`. | | | ||
| // | | cuda2cpu/cpu2cuda/etc blocked: needs NVIDIA CUDA. | | |
There was a problem hiding this comment.
This implies only cuda2cpu/cpu2cuda need a GPU, but auto_cleanup, duplicate_free, read_after_free, and write_after_free also set receiver_device: cuda and require CUDA. So four of the five scenarios listed as "covered" above cannot run on CPU CI as wired. Please correct this note (and the #[ignore = "requires torch and tqdm"] reasons on those four tests) to reflect the real requirement, or flip the YAMLs to receiver_device: cpu so the "covered" claim holds.
|
@tang-canran thanks for the PR — left a review (changes requested). Finding 2 (the deterministic counter fix) looks great and is good to merge as-is. One blocker on Finding 1: four of the five scenarios marked "covered" ( |
The four negative-lifecycle scenarios (duplicate_free, read_after_free, write_after_free, auto_cleanup) previously hardcoded receiver_device: cuda, which caused RuntimeError on any torch-but-no-CUDA machine — exactly the environment the import-torch gate targets. The pool lifecycle contract (register/write/read/free/cleanup) lives in the daemon's device-independent bookkeeping, and the receiver.py assertion is already device-agnostic. CPU path exercises the same unsafe shmem/seqlock/free paths that the GPU path does. Switching to receiver_device: cpu makes the torch-only gate correct and allows these four scenarios to run on CPU CI, actually achieving the goal of Finding 1 from dora-rs#2264. Refs: dora-rs#2264, dora-rs#2268 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@heyong4725 Fixed — switched the four negative-lifecycle YAMLs to receiver_device: cpu. All 6 smoke tests pass on the updated branch:
|
|
Automated review by Claude — this is a fully automated review; no human has vetted it. Treat accordingly. I re-reviewed after the latest commit ( No new issues found in the latest diff:
The two outstanding inline review threads (on One process note unchanged from before: this overlaps with #2267 and #2279, all closing #2264 — only one should land. Generated by Claude Code |
|
Automated review by Claude — this is a fully automated review with no human in the loop; please verify before acting on it. The I do see one problem with the smoke-test wiring, which is the part this PR exists to add: The torch gate checks the wrong interpreter, so the new coverage won't actually exercise the path. In if python3 -c "import torch, tqdm" 2>/dev/null; then
run_networked "memory-pool-cpu2cpu" ...But
The Suggestion: either add a Minor, non-blocking: the four negative-lifecycle YAMLs were flipped to (Note: #2267 and #2279 also target #2264 with overlapping changes to these same files — only one can land without conflicts.) Generated by Claude Code |
Add to cpu2cpu.yml and the four negative-lifecycle YAMLs, so provisions torch into per-node managed environments. This fixes the gate gap reported by phil-opp: smoke-all.sh activates a smoke-venv that lacks torch, so the gate always evaluated false. Also update README.md to document cpu2cpu.yml and note that CPU scenarios no longer require CUDA. Refs: dora-rs#2264, dora-rs#2268 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
🤖 Automated review by Claude (Claude Code). This is a fully automated review with no human in the loop — please treat it as a suggestion. I looked at the new commit
The counter-based assertion change (sender stamps Generated by Claude Code |
1. Remove the smoke-all.sh torch gate — the host-side check always evaluates false because smoke-venv
(activated by earlier --uv Python examples) lacks torch. Rely on
per-node steps instead; dora --uv provisions torch into
each node's managed env.
2. Replace with in all five YAML
steps. replaces PyPI entirely, so numpy/
tqdm are then resolved only from download.pytorch.org/whl/cpu —
which doesn't reliably host them. adds the
CPU torch index while keeping PyPI as the primary source.
Refs: dora-rs#2264, dora-rs#2268
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Automated review by Claude — this is a fully automated review with no human in the loop. Treat it as a suggestion, not an approval. Re-reviewed after commit
The counter-based assertion fix and the CUDA→CPU flip of the negative-lifecycle YAMLs remain correct. No new issues found in the latest diff. (Process note, unchanged from earlier: this still overlaps #2267 / #2279, which also close #2264 — only one of the three can land without conflicts.) Generated by Claude Code |
|
@heyong4725 All feedback from both you and phil-opp has been addressed across the last few commits. phil-opp's latest automated review (commit 7414260) confirmed no new issues. |
Addresses automated review findings from issue #2264 on merged PR #2168.
Finding 1 — zero CI coverage for unsafe transport paths:
Finding 2 — probabilistically-flaky assertion in receiver.py:
Closes #2264