Skip to content

[opt] Incorrect global ID (gid) computation via hipExtModuleLaunchKernel for non-divisible global sizes when using -fno-offload-uniform-block #173588

@0oyyo0

Description

@0oyyo0

Environment

  • Backend: AMDGPU
  • Compiler: Clang / LLVM 17
  • Runtime: HIP (ROCm)
  • Kernel launch API: hipExtModuleLaunchKernel
  • Global size not divisible by block size (tail workgroup exists)
  • Add the -fno-offload-uniform-block option during compilation.

Problem Description
When launching a kernel via hipExtModuleLaunchKernel with:
totalThreads % blockSize != 0
the kernel exhibits incorrect global id (gid) computation:

  • Some lower gid values are executed twice
  • This only happens when a tail workgroup is present(i.e., global size is not divisible by block size)
  • When the global size is divisible by block size, the problem disappears.

demo

  • Reproducing the problem
    hipcc --offload-device-only kernel.hip -o kernel.co -fno-offload-uniform-block #-mcode-object-version=5
    hipcc main.cpp -o main_test
    ./main_test

  • kernel.hip

#include <hip/hip_runtime.h>
#include <stdint.h>

extern "C" __global__
void write_gid_kernel(uint64_t* out)
{
    uint64_t gid = blockIdx.x * blockDim.x + threadIdx.x;
    printf("gid = %lu\n", gid);
    out[gid] = gid;
}
  • main.cpp
#include <hip/hip_runtime.h>
#include <hip/hip_ext.h>
#include <cstdio>
#include <cstdlib>
#include <vector>


int main()
{
    hipModule_t module;
    hipFunction_t kernel;

    hipModuleLoad(&module, "kernel.co");
    hipModuleGetFunction(&kernel, module, "write_gid_kernel");

    // const int block_size    = 256;
    // const int total_threads = 1000;

    const int block_size    = 512;
    const int total_threads = 822528; // 822528 % 512 = 256 

    printf("logical threads = %d\n", total_threads);
    printf("block size      = %d\n", block_size);

    uint64_t* d_out;
    hipMalloc(&d_out, total_threads * sizeof(uint64_t));

    void* args[] = {
        (void*)&d_out
    };

    
    hipExtModuleLaunchKernel(
        kernel,
        total_threads, 1, 1,   
        block_size, 1, 1, 
        0,
        nullptr,
        args,
        nullptr
    );

    hipDeviceSynchronize();

    std::vector<uint64_t> h_out(total_threads, 0);
    hipMemcpy(
        h_out.data(), d_out,
        total_threads * sizeof(uint64_t),
        hipMemcpyDeviceToHost);

    hipFree(d_out);
    hipModuleUnload(module);
    return 0;
}

Observed Behavior
When running the above demo,

  • gid in the range:[411136, 411391] is printed twice

  • gid in the final tail region: [822272, totalThreads) is not printed at all

  • 822528/512=1606; 822528 % 512 = 256; 1606*256=411136; 411136+256-1=411391

  • It is suspected that, for the final remainder workgroup(256 threads), gid computation uses 256 * blockIdx + lane_id, leading to incorrect global IDs.

Expected Behavior
Each gid in [0, totalThreads) should be printed exactly once

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions