Skip to content

[BUG]: STF multi-GPU host_launch write intermittently triggers cudaErrorLaunchFailure with registered host memory #7490

@19970126ljl

Description

@19970126ljl

Is this a duplicate?

Type of Bug

Runtime Error

Component

CUDA Experimental (cudax)

Describe the bug

When running STF on multiple GPUs with per-device partitioned logical_data backed by registered host memory, a host_launch write followed by a GPU task can intermittently trigger an asynchronous cudaErrorLaunchFailure. Single-GPU execution works. Switching the host allocation to cudaMallocHost avoids the crash. The issue is easier to reproduce with a debug build.

Suspected cause (please confirm if this is correct): we suspect a memory-coherency issue when cudaHostRegister-backed memory is written on the CPU and then consumed by a peer GPU via STF; the CPU writes may not be visible to the peer device in time, leading to illegal access or launch failure.

How to Reproduce

Minimal pattern (simplified):

// 1) GPU task on each device
ctx.task(exec_place::device(d), lYs[d].rw())->*[](cudaStream_t s, auto dY) {
  add_one<<<1, 256, 0, s>>>(dY);
};
// 2) host_launch write on each device's data
ctx.host_launch(lYs[d].write())->*[](auto sY) {
  for (size_t i = 0; i < sY.size(); i++) sY(i) = 10.0;
};
// 3) GPU task again on each device
ctx.task(exec_place::device(d), lYs[d].rw())->*[](cudaStream_t s, auto dY) {
  add_one<<<1, 256, 0, s>>>(dY);
};
  1. Build in debug (more likely to trigger):
    make CCCL_ROOT=/path/to/cccl DEBUG=1 host_launch_mgpu
    make DEBUG=1 host_launch_mgpu
  2. Run multiple times:
    for i in {1..10}; do CUDA_VISIBLE_DEVICES=0,1 ./host_launch_mgpu; done
  3. Program steps:
    • GPU task on each device
    • host_launch write on each device’s data
    • GPU task on each device again
      The final step may intermittently trigger cudaErrorLaunchFailure.

issue_repro.zip

Expected behavior

Multi-GPU execution should complete without launch failure, producing correct results (e.g., Y[i] == 11.0) every run.

Reproduction link

No response

Operating System

Ubuntu 22.04.5 LTS

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A800 80GB PCIe Off | 00000000:9C:00.0 Off | 0 |
| N/A 41C P0 70W / 300W | 17171MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A800 80GB PCIe Off | 00000000:9D:00.0 Off | 0 |
| N/A 44C P0 51W / 300W | 24216MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

NVCC version

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:24:57_PDT_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions