Skip to content

[Issue]:GPU Core Dump in Multi-GPU Parallel GEMM Tuning (hipBLASLt 1.1.0, 8x MI308X) #1913

@archwine

Description

@archwine

Problem Description

GPU core dump occurs intermittently during multi-GPU parallel GEMM tuning with hipBLASLt. The issue is non-deterministic: after a crash, rerunning can successfully tune the previously failed shape, but then crashes on a different shape.

Key Finding: --mp 8 (8 GPUs) crashes consistently, --mp 1 (single GPU) works perfectly.

Error Pattern: Memory access fault on different GPU nodes (node-2, node-3) at different shapes (M=31, 672, 1376, 1440, 1568, 6144) and progress points (7%-79%).

Environment

  • GPU: 8x AMD Instinct MI308X (gfx942)
  • ROCm: 6.4.3-127 (HIP: 6.4.43484-123eb5128)
  • Python: 3.10.12
  • PyTorch: 2.8.0+git245bf6e (ROCm build)
  • hipBLAS: 2.4.0.60403-127
  • hipBLASLt: 1.1.0-aed1757c~dirty
  • rocBLAS: 4.4.1.60403-127
  • OS: Linux 5.10.134-16.3.al8.x86_64
  • NUMA Balancing: Enabled (⚠️ known issue)

Crash Evidence (6 documented runs)

Run Progress Failed Shape GPU Node Core Dump Address
1 28% M=31, N=128, K=4096 node-3 gpucore.213735 0x7f0300000000
2 51% M=672, N=128, K=4096 node-2 gpucore.378428 0x7fbe00014000
3 42% M=1376, N=128, K=4096 node-2 gpucore.554058 0x7f0800018000
4 7% M=1440, N=128, K=4096 node-2 gpucore.560948 0x7f5200014000
5 14% M=1568, N=128, K=4096 node-2 gpucore.575199 0x7fa700000000
6 79% M=6144, N=128, K=4096 node-2 gpucore.625995 0x7f760000c000

Observations:

  • Different M values (31 to 6144) - no pattern
  • Different progress points (7% to 79%) - can happen anytime
  • Mostly node-2 (5/6), once node-3
  • All addresses in 0x7f... range (host memory region)
  • Occurs during hipBLASLt fast mode testing (1706 solutions)

Error Message

Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058

Analysis

This is clearly a race condition or resource conflict in multi-GPU parallel execution, not a shape-specific bug. The non-deterministic nature (different shapes fail on different runs) and the fact that single-GPU mode works perfectly confirms this is a concurrency issue in hipBLASLt.

Operating System

Linux 5.10.134-16.3.al8.x86_64

CPU

INTEL(R) XEON(R) PLATINUM 8575C

GPU

8x AMD Instinct MI308X

ROCm Version

6.4.3-127

ROCm Component

No response

Steps to Reproduce

Prerequisites

  • 8x AMD Instinct MI308X GPUs
  • ROCm 6.4.3-127 installed
  • Python 3.10+ with PyTorch 2.8.0 (ROCm build)
  • hipBLASLt 1.1.0

Input File

Create router_gemm_bf16_to_fp32_untuned.csv with 107 GEMM shapes:

M,N,K,bias,dtype,outdtype,scaleAB,bpreshuffle
1,128,4096,False,torch.bfloat16,torch.float32,False,False
2,128,4096,False,torch.bfloat16,torch.float32,False,False
...
31,128,4096,False,torch.bfloat16,torch.float32,False,False
...
6144,128,4096,False,torch.bfloat16,torch.float32,False,False
...
10240,128,4096,False,torch.bfloat16,torch.float32,False,False

(Full file: M=1-32, 40, 48, 56, 64, 96, 128, ..., up to 10240; N=128, K=4096, bf16→fp32)

Reproduction Command (Triggers Crash)

python gradlib/gradlib/gemm_tuner_parallel.py \
    -i aiter/configs/router_gemm_bf16_to_fp32_untuned.csv \
    -o aiter/configs/router_gemm_bf16_to_fp32_tuned.csv \
    --mp 8

Expected Behavior

All 107 GEMM shapes should be tuned successfully without crashes.

Actual Behavior

  • Process crashes with GPU core dump after tuning 7%-79% of shapes
  • Different shapes fail on different runs (non-deterministic)
  • Error: Memory access fault by GPU node-X on address 0x7f...
  • Core dump files generated (~1.2 GB each)

Workaround (Single-GPU Works)

python gradlib/gradlib/gemm_tuner_parallel.py \
    -i aiter/configs/router_gemm_bf16_to_fp32_untuned.csv \
    -o aiter/configs/router_gemm_bf16_to_fp32_tuned.csv \
    --mp 1

This completes successfully without any crashes, confirming the issue is specific to multi-GPU parallel execution.

Sample Crash Log

[aiter] processed 22 batches of 52, Processing Status ====> 42.0% tuned
M N K bias dtype outdtype 1376 128 4096 False torch.bfloat16 torch.float32 False >>> Total hipb solutions 1706
[aiter] import [module_hipbsolgemm] under .../aiter/jit/module_hipbsolgemm.so (8 workers)
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058

Reproducibility

  • Multi-GPU (--mp 8): Crashes 100% of the time (6/6 runs tested)
  • Single-GPU (--mp 1): Works 100% of the time (no crashes)
  • Crash occurs at different shapes on each run (non-deterministic)

router_gemm_bf16_to_fp32_untuned.csv

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.7.0 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFF
FF) (timestamp count)
Machine Model:           LARGE                             
System Endianness:       LITTLE                            
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 9754 128-Core Processor
  Uuid:                    CPU-XX                            
  Marketing Name:          AMD EPYC 9754 128-Core Processor
  Vendor Name:             CPU                               
  Feature:                 None specified                    
  Profile:                 FULL_PROFILE                      
  Float Round Mode:        NEAR                              
  Max Queue Number:        0(0x0)                            
  Queue Min Size:          0(0x0)                            
  Queue Max Size:          0(0x0)                            
  Queue Type:              MULTI                             
  Node:                    0                                 
  Device Type:             CPU                               
  Cache Info:              
    L1:                      32768(0x8000) KB                  
  Chip ID:                 0(0x0)                            
  ASIC Revision:           0(0x0)                            
  Cacheline Size:          64(0x40)                          
  Max Clock Freq. (MHz):   3100                              
  BDFID:                   0                                 
  Internal Node ID:        0                                 
  Compute Unit:            256                               
  SIMDs per CU:            0                                 
  Shader Engines:          0                                 
  Shader Arrs. per Eng.:   0                                 
  WatchPts on Addr. Ranges:1                                 
  Coherent Host Access:    FALSE                             
  Features:                None
  Fast F16 Operation:      FALSE                             
  Wavefront Size:          0(0x0)                            
  Workgroup Max Size:      0(0x0)                            
  Workgroup Max Size per Dimension:
    x                        0(0x0)                            
    y                        0(0x0)                            
    z                        0(0x0)                            
  Max Waves Per CU:        0(0x0)                            
  Max Work-item Per CU:    0(0x0)                            
  Grid Max Size:           0(0x0)                            
  Grid Max Size per Dimension:
    x                        0(0x0)                            
    y                        0(0x0)                            
    z                        0(0x0)                            
  Max fbarriers/Workgrp:   0                                 
  Packet Processor uCode:: 0                                 
  SDMA engine uCode::      0                                 
  IOMMU Support::          None                              
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED HOST ALLOC
      Size:                    1056561152(0x3ef80000) KB         
      Allocatable:             TRUE                              
      Alloc Granule:           4KB                               
      Alloc Recommended Granule:4KB                              
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED HOST ALLOC
      Size:                    1056561152(0x3ef80000) KB         
      Allocatable:             TRUE                              
      Alloc Granule:           4KB                               
      Alloc Recommended Granule:4KB                              
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                              
  ISA Info:                

*******                  
Agent 2                  
*******                  
  Name:                    gfx942                            
  Uuid:                    GPU-a2d5d6b7e1c7e5e8              
  Marketing Name:          AMD Instinct MI308X               
  Vendor Name:             AMD                               
  Feature:                 KERNEL_DISPATCH                   
  Profile:                 BASE_PROFILE                      
  Float Round Mode:        NEAR                              
  Max Queue Number:        128(0x80)                         
  Queue Min Size:          64(0x40)                          
  Queue Max Size:          131072(0x20000)                   
  Queue Type:              MULTI                             
  Node:                    2                                 
  Device Type:             GPU                               
  Cache Info:              
    L1:                      32(0x20) KB                       
    L2:                      32768(0x8000) KB                  
    L3:                      262144(0x40000) KB                
  Chip ID:                 29858(0x74a2)                     
  ASIC Revision:           0(0x0)                            
  Cacheline Size:          64(0x40)                          
  Max Clock Freq. (MHz):   2100                              
  BDFID:                   49152                             
  Internal Node ID:        2                                 
  Compute Unit:            304                               
  SIMDs per CU:            4                                 
  Shader Engines:          8                                 
  Shader Arrs. per Eng.:   1                                 
  WatchPts on Addr. Ranges:4                                 
  Coherent Host Access:    FALSE                             
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                              
  Wavefront Size:          64(0x40)                          
  Workgroup Max Size:      1024(0x400)                       
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                       
    y                        1024(0x400)                       
    z                        1024(0x400)                       
  Max Waves Per CU:        32(0x20)                          
  Max Work-item Per CU:    2048(0x800)                       
  Grid Max Size:           4294967295(0xffffffff)            
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)            
    y                        4294967295(0xffffffff)            
    z                        4294967295(0xffffffff)            
  Max fbarriers/Workgrp:   32                                
  Packet Processor uCode:: 186                               
  SDMA engine uCode::      24                                
  IOMMU Support::          None                              
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED  
      Size:                    201310208(0xbffc000) KB         
      Allocatable:             TRUE                            
      Alloc Granule:           4KB                             
      Alloc Recommended Granule:2048KB                         
      Alloc Alignment:         4KB                             
      Accessible by all:       FALSE                           
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB         
      Allocatable:             TRUE                            
      Alloc Granule:           4KB                             
      Alloc Recommended Granule:2048KB                         
      Alloc Alignment:         4KB                             
      Accessible by all:       FALSE                           
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED    
      Size:                    201310208(0xbffc000) KB         
      Allocatable:             TRUE                            
      Alloc Granule:           4KB                             
      Alloc Recommended Granule:2048KB                         
      Alloc Alignment:         4KB                             
      Accessible by all:       FALSE                           
    Pool 4                   
      Segment:                 GROUP                          
      Size:                    64(0x40) KB                     
      Allocatable:             FALSE                           
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE         
      Profiles:                HSA_PROFILE_BASE                
      Default Rounding Mode:   NEAR                            
      Fast f16:                TRUE                            
      Workgroup Max Size:      1024(0x400)                     
      Grid Max Size:           4294967295(0xffffffff)          
      FBarrier Max Size:       32                              

*******                  
Agent 3-9 (GPU 1-7)      
*******                  
(Same configuration as Agent 2, all MI308X gfx942)

*** Done ***

Key Information:

  • 8x AMD Instinct MI308X (gfx942)
  • 304 Compute Units per GPU
  • 192 GB memory per GPU (COARSE/FINE/EXTENDED FINE GRAINED pools)
  • ROCm Runtime 1.14, Ext 1.6
  • ISA: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-

Additional Information

Core Dump Files

6 core dump files generated (~1.2 GB each):

  • gpucore.213735, gpucore.378428, gpucore.554058
  • gpucore.560948, gpucore.575199, gpucore.625995

Available upon request (too large for direct upload). Can provide via cloud storage or extract specific information if needed.

System Configuration Issues

NUMA Balancing: Currently ENABLED (value: 1)

$ cat /proc/sys/kernel/numa_balancing
1

⚠️ This is a known issue with ROCm. AMD recommends disabling it:

sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'

Reference: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing

However, even with NUMA balancing enabled, single-GPU mode works fine, suggesting the issue is specifically in hipBLASLt's multi-GPU handling.

ROCTracer Warnings

All crashes are preceded by multiple ROCTracer warnings:

[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4 (function operator())

These warnings appear consistently across all 8 GPU workers before the memory access fault.

Detailed Crash Logs

Full logs for all 6 crash occurrences available in attached GPU_CORE_DUMP_ISSUE_REPORT.md.

Sample pattern:

(M=1344 completed successfully)
[aiter] processed 22 batches of 52, Processing Status ====> 42.0% tuned
M N K bias dtype outdtype 1376 128 4096 False torch.bfloat16 torch.float32 False >>> Total hipb solutions 1706
[aiter] import [module_hipbsolgemm] (8 workers spawned)
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
[W126 03:03:30.787522382 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058

Input File

router_gemm_bf16_to_fp32_untuned.csv - 107 GEMM shapes for router operations:

  • M: 1-32 (consecutive), then 40, 48, 56, 64, 96, 128, ..., up to 10240
  • N: 128 (constant)
  • K: 4096 (constant)
  • dtype: bfloat16 → float32
  • All shapes: no bias, no scaleAB, no preshuffle

Can be provided as attachment or GitHub Gist.

Workarounds Tested

  1. Single-GPU mode (--mp 1): Works perfectly, 100% success rate
  2. ⚠️ Reduced parallelism (--mp 4, --mp 2): Not tested yet, but likely to reduce crash frequency
  3. Disabling NUMA balancing: Not tested yet (requires root access)

Questions for ROCm Team

  1. Are there known concurrency issues in hipBLASLt 1.1.0 with multi-GPU solution testing?
  2. Should explicit synchronization barriers be added between parallel workers?
  3. How to properly analyze the GPU core dump files to identify the specific hipBLASLt kernel causing the fault?
  4. Is there a recommended maximum parallelism level for hipBLASLt tuning on MI308X?
  5. Could the 0x7f... memory addresses (host memory region) indicate a host-device synchronization issue?

Repository Information

  • Code: aiter_extensions (commit: d8d8b9023cd3d081120e820d1c62718cad0d15f3)
  • Tuning script: gradlib/gradlib/gemm_tuner_parallel.py
  • GEMM tuner: gradlib/gradlib/GemmTuner.py
  • Multi-process tuner: aiter/utility/mp_tuner.py

Environment Variables

HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

All 8 GPUs are visible and accessible. Single-GPU tests confirm each GPU works individually.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions