[opt] Incorrect global ID (gid) computation via hipExtModuleLaunchKernel for non-divisible global sizes when using -fno-offload-uniform-block

**Environment**
- Backend: AMDGPU
- Compiler: Clang / LLVM 17
- Runtime: HIP (ROCm)
- Kernel launch API: hipExtModuleLaunchKernel
- Global size not divisible by block size (tail workgroup exists)
- Add the -fno-offload-uniform-block option during compilation.

**Problem Description**
When launching a kernel via hipExtModuleLaunchKernel with:
`totalThreads % blockSize != 0`
the kernel exhibits incorrect global id (gid) computation:
- Some lower gid values are executed twice
- This only happens when a tail workgroup is present(i.e., global size is not divisible by block size)
- When the global size is divisible by block size, the problem disappears.

**demo**
- Reproducing the problem
`hipcc --offload-device-only kernel.hip -o kernel.co -fno-offload-uniform-block #-mcode-object-version=5`
`hipcc main.cpp -o main_test`
`./main_test`


- kernel.hip
```
#include <hip/hip_runtime.h>
#include <stdint.h>

extern "C" __global__
void write_gid_kernel(uint64_t* out)
{
    uint64_t gid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("gid = %lu\n", gid);
    out[gid] = gid;
}
```


- main.cpp

```
#include <hip/hip_runtime.h>
#include <hip/hip_ext.h>
#include <cstdio>
#include <cstdlib>
#include <vector>


int main()
{
    hipModule_t module;
    hipFunction_t kernel;

    hipModuleLoad(&module, "kernel.co");
    hipModuleGetFunction(&kernel, module, "write_gid_kernel");

    // const int block_size    = 256;
    // const int total_threads = 1000;

    const int block_size    = 512;
    const int total_threads = 822528; // 822528 % 512 = 256 

    printf("logical threads = %d\n", total_threads);
    printf("block size      = %d\n", block_size);

    uint64_t* d_out;
    hipMalloc(&d_out, total_threads * sizeof(uint64_t));

    void* args[] = {
        (void*)&d_out
    };

    
    hipExtModuleLaunchKernel(
        kernel,
        total_threads, 1, 1,   
        block_size, 1, 1, 
        0,
        nullptr,
        args,
        nullptr
    );

    hipDeviceSynchronize();

    std::vector<uint64_t> h_out(total_threads, 0);
    hipMemcpy(
        h_out.data(), d_out,
        total_threads * sizeof(uint64_t),
        hipMemcpyDeviceToHost);

    hipFree(d_out);
    hipModuleUnload(module);
    return 0;
}
```


**Observed Behavior**
When running the above demo, 
- gid in the range:[411136, 411391] is printed twice
- gid in the final tail region: [822272, totalThreads) is not printed at all
- 822528/512=1606; 822528 % 512 = 256; 1606*256=411136; 411136+256-1=411391

- It is suspected that, for the final remainder workgroup(256 threads), gid computation uses 256 * blockIdx + lane_id, leading to incorrect global IDs. 

**Expected Behavior**
Each gid in [0, totalThreads) should be printed exactly once




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt] Incorrect global ID (gid) computation via hipExtModuleLaunchKernel for non-divisible global sizes when using -fno-offload-uniform-block #173588

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[opt] Incorrect global ID (gid) computation via hipExtModuleLaunchKernel for non-divisible global sizes when using -fno-offload-uniform-block #173588

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions