[Issue]:GPU Core Dump in Multi-GPU Parallel GEMM Tuning (hipBLASLt 1.1.0, 8x MI308X)

### Problem Description

GPU core dump occurs intermittently during multi-GPU parallel GEMM tuning with hipBLASLt. The issue is **non-deterministic**: after a crash, rerunning can successfully tune the previously failed shape, but then crashes on a different shape.

**Key Finding**: `--mp 8` (8 GPUs) crashes consistently, `--mp 1` (single GPU) works perfectly.

**Error Pattern**: Memory access fault on different GPU nodes (node-2, node-3) at different shapes (M=31, 672, 1376, 1440, 1568, 6144) and progress points (7%-79%).

### Environment
- **GPU**: 8x AMD Instinct MI308X (gfx942)
- **ROCm**: 6.4.3-127 (HIP: 6.4.43484-123eb5128)
- **Python**: 3.10.12
- **PyTorch**: 2.8.0+git245bf6e (ROCm build)
- **hipBLAS**: 2.4.0.60403-127
- **hipBLASLt**: 1.1.0-aed1757c~dirty
- **rocBLAS**: 4.4.1.60403-127
- **OS**: Linux 5.10.134-16.3.al8.x86_64
- **NUMA Balancing**: Enabled (⚠️ known issue)

### Crash Evidence (6 documented runs)

| Run | Progress | Failed Shape | GPU Node | Core Dump | Address |
|-----|----------|--------------|----------|-----------|---------|
| 1 | 28% | M=31, N=128, K=4096 | node-3 | gpucore.213735 | 0x7f0300000000 |
| 2 | 51% | M=672, N=128, K=4096 | node-2 | gpucore.378428 | 0x7fbe00014000 |
| 3 | 42% | M=1376, N=128, K=4096 | node-2 | gpucore.554058 | 0x7f0800018000 |
| 4 | 7% | M=1440, N=128, K=4096 | node-2 | gpucore.560948 | 0x7f5200014000 |
| 5 | 14% | M=1568, N=128, K=4096 | node-2 | gpucore.575199 | 0x7fa700000000 |
| 6 | 79% | M=6144, N=128, K=4096 | node-2 | gpucore.625995 | 0x7f760000c000 |

**Observations**:
- Different M values (31 to 6144) - no pattern
- Different progress points (7% to 79%) - can happen anytime
- Mostly node-2 (5/6), once node-3
- All addresses in 0x7f... range (host memory region)
- Occurs during hipBLASLt fast mode testing (1706 solutions)

### Error Message
```
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058
```

### Analysis
This is clearly a **race condition or resource conflict** in multi-GPU parallel execution, not a shape-specific bug. The non-deterministic nature (different shapes fail on different runs) and the fact that single-GPU mode works perfectly confirms this is a concurrency issue in hipBLASLt.

### Operating System

Linux 5.10.134-16.3.al8.x86_64

### CPU

INTEL(R) XEON(R) PLATINUM 8575C

### GPU

8x AMD Instinct MI308X

### ROCm Version

6.4.3-127

### ROCm Component

_No response_

### Steps to Reproduce

### Prerequisites
- 8x AMD Instinct MI308X GPUs
- ROCm 6.4.3-127 installed
- Python 3.10+ with PyTorch 2.8.0 (ROCm build)
- hipBLASLt 1.1.0

### Input File
Create `router_gemm_bf16_to_fp32_untuned.csv` with 107 GEMM shapes:
```csv
M,N,K,bias,dtype,outdtype,scaleAB,bpreshuffle
1,128,4096,False,torch.bfloat16,torch.float32,False,False
2,128,4096,False,torch.bfloat16,torch.float32,False,False
...
31,128,4096,False,torch.bfloat16,torch.float32,False,False
...
6144,128,4096,False,torch.bfloat16,torch.float32,False,False
...
10240,128,4096,False,torch.bfloat16,torch.float32,False,False
```
(Full file: M=1-32, 40, 48, 56, 64, 96, 128, ..., up to 10240; N=128, K=4096, bf16→fp32)

### Reproduction Command (Triggers Crash)
```bash
python gradlib/gradlib/gemm_tuner_parallel.py \
    -i aiter/configs/router_gemm_bf16_to_fp32_untuned.csv \
    -o aiter/configs/router_gemm_bf16_to_fp32_tuned.csv \
    --mp 8
```

### Expected Behavior
All 107 GEMM shapes should be tuned successfully without crashes.

### Actual Behavior
- Process crashes with GPU core dump after tuning 7%-79% of shapes
- Different shapes fail on different runs (non-deterministic)
- Error: `Memory access fault by GPU node-X on address 0x7f...`
- Core dump files generated (~1.2 GB each)

### Workaround (Single-GPU Works)
```bash
python gradlib/gradlib/gemm_tuner_parallel.py \
    -i aiter/configs/router_gemm_bf16_to_fp32_untuned.csv \
    -o aiter/configs/router_gemm_bf16_to_fp32_tuned.csv \
    --mp 1
```
This completes successfully without any crashes, confirming the issue is specific to multi-GPU parallel execution.

### Sample Crash Log
```
[aiter] processed 22 batches of 52, Processing Status ====> 42.0% tuned
M N K bias dtype outdtype 1376 128 4096 False torch.bfloat16 torch.float32 False >>> Total hipb solutions 1706
[aiter] import [module_hipbsolgemm] under .../aiter/jit/module_hipbsolgemm.so (8 workers)
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058
```

### Reproducibility
- **Multi-GPU (--mp 8)**: Crashes 100% of the time (6/6 runs tested)
- **Single-GPU (--mp 1)**: Works 100% of the time (no crashes)
- Crash occurs at different shapes on each run (non-deterministic)

[router_gemm_bf16_to_fp32_untuned.csv](https://github.com/user-attachments/files/24875392/router_gemm_bf16_to_fp32_untuned.csv)

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

```
ROCk module version 6.7.0 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFF
FF) (timestamp count)
Machine Model:           LARGE                             
System Endianness:       LITTLE                            
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 9754 128-Core Processor
  Uuid:                    CPU-XX                            
  Marketing Name:          AMD EPYC 9754 128-Core Processor
  Vendor Name:             CPU                               
  Feature:                 None specified                    
  Profile:                 FULL_PROFILE                      
  Float Round Mode:        NEAR                              
  Max Queue Number:        0(0x0)                            
  Queue Min Size:          0(0x0)                            
  Queue Max Size:          0(0x0)                            
  Queue Type:              MULTI                             
  Node:                    0                                 
  Device Type:             CPU                               
  Cache Info:              
    L1:                      32768(0x8000) KB                  
  Chip ID:                 0(0x0)                            
  ASIC Revision:           0(0x0)                            
  Cacheline Size:          64(0x40)                          
  Max Clock Freq. (MHz):   3100                              
  BDFID:                   0                                 
  Internal Node ID:        0                                 
  Compute Unit:            256                               
  SIMDs per CU:            0                                 
  Shader Engines:          0                                 
  Shader Arrs. per Eng.:   0                                 
  WatchPts on Addr. Ranges:1                                 
  Coherent Host Access:    FALSE                             
  Features:                None
  Fast F16 Operation:      FALSE                             
  Wavefront Size:          0(0x0)                            
  Workgroup Max Size:      0(0x0)                            
  Workgroup Max Size per Dimension:
    x                        0(0x0)                            
    y                        0(0x0)                            
    z                        0(0x0)                            
  Max Waves Per CU:        0(0x0)                            
  Max Work-item Per CU:    0(0x0)                            
  Grid Max Size:           0(0x0)                            
  Grid Max Size per Dimension:
    x                        0(0x0)                            
    y                        0(0x0)                            
    z                        0(0x0)                            
  Max fbarriers/Workgrp:   0                                 
  Packet Processor uCode:: 0                                 
  SDMA engine uCode::      0                                 
  IOMMU Support::          None                              
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED HOST ALLOC
      Size:                    1056561152(0x3ef80000) KB         
      Allocatable:             TRUE                              
      Alloc Granule:           4KB                               
      Alloc Recommended Granule:4KB                              
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED HOST ALLOC
      Size:                    1056561152(0x3ef80000) KB         
      Allocatable:             TRUE                              
      Alloc Granule:           4KB                               
      Alloc Recommended Granule:4KB                              
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                              
  ISA Info:                

*******                  
Agent 2                  
*******                  
  Name:                    gfx942                            
  Uuid:                    GPU-a2d5d6b7e1c7e5e8              
  Marketing Name:          AMD Instinct MI308X               
  Vendor Name:             AMD                               
  Feature:                 KERNEL_DISPATCH                   
  Profile:                 BASE_PROFILE                      
  Float Round Mode:        NEAR                              
  Max Queue Number:        128(0x80)                         
  Queue Min Size:          64(0x40)                          
  Queue Max Size:          131072(0x20000)                   
  Queue Type:              MULTI                             
  Node:                    2                                 
  Device Type:             GPU                               
  Cache Info:              
    L1:                      32(0x20) KB                       
    L2:                      32768(0x8000) KB                  
    L3:                      262144(0x40000) KB                
  Chip ID:                 29858(0x74a2)                     
  ASIC Revision:           0(0x0)                            
  Cacheline Size:          64(0x40)                          
  Max Clock Freq. (MHz):   2100                              
  BDFID:                   49152                             
  Internal Node ID:        2                                 
  Compute Unit:            304                               
  SIMDs per CU:            4                                 
  Shader Engines:          8                                 
  Shader Arrs. per Eng.:   1                                 
  WatchPts on Addr. Ranges:4                                 
  Coherent Host Access:    FALSE                             
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                              
  Wavefront Size:          64(0x40)                          
  Workgroup Max Size:      1024(0x400)                       
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                       
    y                        1024(0x400)                       
    z                        1024(0x400)                       
  Max Waves Per CU:        32(0x20)                          
  Max Work-item Per CU:    2048(0x800)                       
  Grid Max Size:           4294967295(0xffffffff)            
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)            
    y                        4294967295(0xffffffff)            
    z                        4294967295(0xffffffff)            
  Max fbarriers/Workgrp:   32                                
  Packet Processor uCode:: 186                               
  SDMA engine uCode::      24                                
  IOMMU Support::          None                              
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED  
      Size:                    201310208(0xbffc000) KB         
      Allocatable:             TRUE                            
      Alloc Granule:           4KB                             
      Alloc Recommended Granule:2048KB                         
      Alloc Alignment:         4KB                             
      Accessible by all:       FALSE                           
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB         
      Allocatable:             TRUE                            
      Alloc Granule:           4KB                             
      Alloc Recommended Granule:2048KB                         
      Alloc Alignment:         4KB                             
      Accessible by all:       FALSE                           
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED    
      Size:                    201310208(0xbffc000) KB         
      Allocatable:             TRUE                            
      Alloc Granule:           4KB                             
      Alloc Recommended Granule:2048KB                         
      Alloc Alignment:         4KB                             
      Accessible by all:       FALSE                           
    Pool 4                   
      Segment:                 GROUP                          
      Size:                    64(0x40) KB                     
      Allocatable:             FALSE                           
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE         
      Profiles:                HSA_PROFILE_BASE                
      Default Rounding Mode:   NEAR                            
      Fast f16:                TRUE                            
      Workgroup Max Size:      1024(0x400)                     
      Grid Max Size:           4294967295(0xffffffff)          
      FBarrier Max Size:       32                              

*******                  
Agent 3-9 (GPU 1-7)      
*******                  
(Same configuration as Agent 2, all MI308X gfx942)

*** Done ***
```

**Key Information**:
- 8x AMD Instinct MI308X (gfx942)
- 304 Compute Units per GPU
- 192 GB memory per GPU (COARSE/FINE/EXTENDED FINE GRAINED pools)
- ROCm Runtime 1.14, Ext 1.6
- ISA: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-


### Additional Information

### Core Dump Files
6 core dump files generated (~1.2 GB each):
- `gpucore.213735`, `gpucore.378428`, `gpucore.554058`
- `gpucore.560948`, `gpucore.575199`, `gpucore.625995`

Available upon request (too large for direct upload). Can provide via cloud storage or extract specific information if needed.

### System Configuration Issues
**NUMA Balancing**: Currently **ENABLED** (value: 1)
```bash
$ cat /proc/sys/kernel/numa_balancing
1
```
⚠️ This is a known issue with ROCm. AMD recommends disabling it:
```bash
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
```
Reference: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing

However, even with NUMA balancing enabled, single-GPU mode works fine, suggesting the issue is specifically in hipBLASLt's multi-GPU handling.

### ROCTracer Warnings
All crashes are preceded by multiple ROCTracer warnings:
```
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4 (function operator())
```
These warnings appear consistently across all 8 GPU workers before the memory access fault.

### Detailed Crash Logs
Full logs for all 6 crash occurrences available in attached `GPU_CORE_DUMP_ISSUE_REPORT.md`.

Sample pattern:
```
(M=1344 completed successfully)
[aiter] processed 22 batches of 52, Processing Status ====> 42.0% tuned
M N K bias dtype outdtype 1376 128 4096 False torch.bfloat16 torch.float32 False >>> Total hipb solutions 1706
[aiter] import [module_hipbsolgemm] (8 workers spawned)
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
[W126 03:03:30.787522382 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058
```

### Input File
`router_gemm_bf16_to_fp32_untuned.csv` - 107 GEMM shapes for router operations:
- M: 1-32 (consecutive), then 40, 48, 56, 64, 96, 128, ..., up to 10240
- N: 128 (constant)
- K: 4096 (constant)
- dtype: bfloat16 → float32
- All shapes: no bias, no scaleAB, no preshuffle

Can be provided as attachment or GitHub Gist.

### Workarounds Tested
1. ✅ **Single-GPU mode (`--mp 1`)**: Works perfectly, 100% success rate
2. ⚠️ **Reduced parallelism (`--mp 4`, `--mp 2`)**: Not tested yet, but likely to reduce crash frequency
3. ❌ **Disabling NUMA balancing**: Not tested yet (requires root access)

### Questions for ROCm Team
1. Are there known concurrency issues in hipBLASLt 1.1.0 with multi-GPU solution testing?
2. Should explicit synchronization barriers be added between parallel workers?
3. How to properly analyze the GPU core dump files to identify the specific hipBLASLt kernel causing the fault?
4. Is there a recommended maximum parallelism level for hipBLASLt tuning on MI308X?
5. Could the 0x7f... memory addresses (host memory region) indicate a host-device synchronization issue?

### Repository Information
- **Code**: aiter_extensions (commit: d8d8b9023cd3d081120e820d1c62718cad0d15f3)
- **Tuning script**: `gradlib/gradlib/gemm_tuner_parallel.py`
- **GEMM tuner**: `gradlib/gradlib/GemmTuner.py`
- **Multi-process tuner**: `aiter/utility/mp_tuner.py`

### Environment Variables
```bash
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
```

All 8 GPUs are visible and accessible. Single-GPU tests confirm each GPU works individually.


Run	Progress	Failed Shape	GPU Node	Core Dump	Address
1	28%	M=31, N=128, K=4096	node-3	gpucore.213735	0x7f0300000000
2	51%	M=672, N=128, K=4096	node-2	gpucore.378428	0x7fbe00014000
3	42%	M=1376, N=128, K=4096	node-2	gpucore.554058	0x7f0800018000
4	7%	M=1440, N=128, K=4096	node-2	gpucore.560948	0x7f5200014000
5	14%	M=1568, N=128, K=4096	node-2	gpucore.575199	0x7fa700000000
6	79%	M=6144, N=128, K=4096	node-2	gpucore.625995	0x7f760000c000

[Issue]:GPU Core Dump in Multi-GPU Parallel GEMM Tuning (hipBLASLt 1.1.0, 8x MI308X) #1913

Description

Problem Description

Environment

Crash Evidence (6 documented runs)

Error Message

Analysis

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

Prerequisites

Input File

Reproduction Command (Triggers Crash)

Expected Behavior

Actual Behavior

Workaround (Single-GPU Works)

Sample Crash Log

Reproducibility

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Core Dump Files

System Configuration Issues

ROCTracer Warnings

Detailed Crash Logs

Input File

Workarounds Tested

Questions for ROCm Team

Repository Information

Environment Variables

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions