-
Notifications
You must be signed in to change notification settings - Fork 332
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Runtime Error
Component
CUDA Experimental (cudax)
Describe the bug
When running STF on multiple GPUs with per-device partitioned logical_data backed by registered host memory, a host_launch write followed by a GPU task can intermittently trigger an asynchronous cudaErrorLaunchFailure. Single-GPU execution works. Switching the host allocation to cudaMallocHost avoids the crash. The issue is easier to reproduce with a debug build.
Suspected cause (please confirm if this is correct): we suspect a memory-coherency issue when cudaHostRegister-backed memory is written on the CPU and then consumed by a peer GPU via STF; the CPU writes may not be visible to the peer device in time, leading to illegal access or launch failure.
How to Reproduce
Minimal pattern (simplified):
// 1) GPU task on each device
ctx.task(exec_place::device(d), lYs[d].rw())->*[](cudaStream_t s, auto dY) {
add_one<<<1, 256, 0, s>>>(dY);
};
// 2) host_launch write on each device's data
ctx.host_launch(lYs[d].write())->*[](auto sY) {
for (size_t i = 0; i < sY.size(); i++) sY(i) = 10.0;
};
// 3) GPU task again on each device
ctx.task(exec_place::device(d), lYs[d].rw())->*[](cudaStream_t s, auto dY) {
add_one<<<1, 256, 0, s>>>(dY);
};- Build in debug (more likely to trigger):
make CCCL_ROOT=/path/to/cccl DEBUG=1 host_launch_mgpu make DEBUG=1 host_launch_mgpu
- Run multiple times:
for i in {1..10}; do CUDA_VISIBLE_DEVICES=0,1 ./host_launch_mgpu; done
- Program steps:
- GPU task on each device
host_launchwrite on each device’s data- GPU task on each device again
The final step may intermittently triggercudaErrorLaunchFailure.
Expected behavior
Multi-GPU execution should complete without launch failure, producing correct results (e.g., Y[i] == 11.0) every run.
Reproduction link
No response
Operating System
Ubuntu 22.04.5 LTS
nvidia-smi output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A800 80GB PCIe Off | 00000000:9C:00.0 Off | 0 |
| N/A 41C P0 70W / 300W | 17171MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A800 80GB PCIe Off | 00000000:9D:00.0 Off | 0 |
| N/A 44C P0 51W / 300W | 24216MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
NVCC version
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:24:57_PDT_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0
Metadata
Metadata
Assignees
Labels
Type
Projects
Status