[BUG]: STF multi-GPU host_launch write intermittently triggers cudaErrorLaunchFailure with registered host memory

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this bug and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Type of Bug

Runtime Error

### Component

CUDA Experimental (cudax)

### Describe the bug

When running STF on multiple GPUs with per-device partitioned `logical_data` backed by registered host memory, a `host_launch` write followed by a GPU task can intermittently trigger an asynchronous `cudaErrorLaunchFailure`. Single-GPU execution works. Switching the host allocation to `cudaMallocHost` avoids the crash. The issue is easier to reproduce with a debug build.

**Suspected cause (please confirm if this is correct):** we suspect a memory-coherency issue when `cudaHostRegister`-backed memory is written on the CPU and then consumed by a peer GPU via STF; the CPU writes may not be visible to the peer device in time, leading to illegal access or launch failure.

### How to Reproduce

**Minimal pattern (simplified):**
```cpp
// 1) GPU task on each device
ctx.task(exec_place::device(d), lYs[d].rw())->*[](cudaStream_t s, auto dY) {
  add_one<<<1, 256, 0, s>>>(dY);
};
// 2) host_launch write on each device's data
ctx.host_launch(lYs[d].write())->*[](auto sY) {
  for (size_t i = 0; i < sY.size(); i++) sY(i) = 10.0;
};
// 3) GPU task again on each device
ctx.task(exec_place::device(d), lYs[d].rw())->*[](cudaStream_t s, auto dY) {
  add_one<<<1, 256, 0, s>>>(dY);
};
```

1. Build in debug (more likely to trigger):
   ```bash
   make CCCL_ROOT=/path/to/cccl DEBUG=1 host_launch_mgpu
   make DEBUG=1 host_launch_mgpu
   ```
2. Run multiple times:
   ```bash
   for i in {1..10}; do CUDA_VISIBLE_DEVICES=0,1 ./host_launch_mgpu; done
   ```
3. Program steps:
   - GPU task on each device
   - `host_launch` write on each device’s data
   - GPU task on each device again
   The final step may intermittently trigger `cudaErrorLaunchFailure`.

[issue_repro.zip](https://github.com/user-attachments/files/25065881/issue_repro.zip)

### Expected behavior

Multi-GPU execution should complete without launch failure, producing correct results (e.g., `Y[i] == 11.0`) every run.

### Reproduction link

_No response_

### Operating System

Ubuntu 22.04.5 LTS

### nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A800 80GB PCIe          Off |   00000000:9C:00.0 Off |                    0 |
| N/A   41C    P0             70W /  300W |   17171MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          Off |   00000000:9D:00.0 Off |                    0 |
| N/A   44C    P0             51W /  300W |   24216MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

### NVCC version

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:24:57_PDT_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: STF multi-GPU host_launch write intermittently triggers cudaErrorLaunchFailure with registered host memory #7490

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: STF multi-GPU host_launch write intermittently triggers cudaErrorLaunchFailure with registered host memory #7490

Description

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions