-
Notifications
You must be signed in to change notification settings - Fork 186
Description
Problem Description
GPU core dump occurs intermittently during multi-GPU parallel GEMM tuning with hipBLASLt. The issue is non-deterministic: after a crash, rerunning can successfully tune the previously failed shape, but then crashes on a different shape.
Key Finding: --mp 8 (8 GPUs) crashes consistently, --mp 1 (single GPU) works perfectly.
Error Pattern: Memory access fault on different GPU nodes (node-2, node-3) at different shapes (M=31, 672, 1376, 1440, 1568, 6144) and progress points (7%-79%).
Environment
- GPU: 8x AMD Instinct MI308X (gfx942)
- ROCm: 6.4.3-127 (HIP: 6.4.43484-123eb5128)
- Python: 3.10.12
- PyTorch: 2.8.0+git245bf6e (ROCm build)
- hipBLAS: 2.4.0.60403-127
- hipBLASLt: 1.1.0-aed1757c~dirty
- rocBLAS: 4.4.1.60403-127
- OS: Linux 5.10.134-16.3.al8.x86_64
- NUMA Balancing: Enabled (
⚠️ known issue)
Crash Evidence (6 documented runs)
| Run | Progress | Failed Shape | GPU Node | Core Dump | Address |
|---|---|---|---|---|---|
| 1 | 28% | M=31, N=128, K=4096 | node-3 | gpucore.213735 | 0x7f0300000000 |
| 2 | 51% | M=672, N=128, K=4096 | node-2 | gpucore.378428 | 0x7fbe00014000 |
| 3 | 42% | M=1376, N=128, K=4096 | node-2 | gpucore.554058 | 0x7f0800018000 |
| 4 | 7% | M=1440, N=128, K=4096 | node-2 | gpucore.560948 | 0x7f5200014000 |
| 5 | 14% | M=1568, N=128, K=4096 | node-2 | gpucore.575199 | 0x7fa700000000 |
| 6 | 79% | M=6144, N=128, K=4096 | node-2 | gpucore.625995 | 0x7f760000c000 |
Observations:
- Different M values (31 to 6144) - no pattern
- Different progress points (7% to 79%) - can happen anytime
- Mostly node-2 (5/6), once node-3
- All addresses in 0x7f... range (host memory region)
- Occurs during hipBLASLt fast mode testing (1706 solutions)
Error Message
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058
Analysis
This is clearly a race condition or resource conflict in multi-GPU parallel execution, not a shape-specific bug. The non-deterministic nature (different shapes fail on different runs) and the fact that single-GPU mode works perfectly confirms this is a concurrency issue in hipBLASLt.
Operating System
Linux 5.10.134-16.3.al8.x86_64
CPU
INTEL(R) XEON(R) PLATINUM 8575C
GPU
8x AMD Instinct MI308X
ROCm Version
6.4.3-127
ROCm Component
No response
Steps to Reproduce
Prerequisites
- 8x AMD Instinct MI308X GPUs
- ROCm 6.4.3-127 installed
- Python 3.10+ with PyTorch 2.8.0 (ROCm build)
- hipBLASLt 1.1.0
Input File
Create router_gemm_bf16_to_fp32_untuned.csv with 107 GEMM shapes:
M,N,K,bias,dtype,outdtype,scaleAB,bpreshuffle
1,128,4096,False,torch.bfloat16,torch.float32,False,False
2,128,4096,False,torch.bfloat16,torch.float32,False,False
...
31,128,4096,False,torch.bfloat16,torch.float32,False,False
...
6144,128,4096,False,torch.bfloat16,torch.float32,False,False
...
10240,128,4096,False,torch.bfloat16,torch.float32,False,False(Full file: M=1-32, 40, 48, 56, 64, 96, 128, ..., up to 10240; N=128, K=4096, bf16→fp32)
Reproduction Command (Triggers Crash)
python gradlib/gradlib/gemm_tuner_parallel.py \
-i aiter/configs/router_gemm_bf16_to_fp32_untuned.csv \
-o aiter/configs/router_gemm_bf16_to_fp32_tuned.csv \
--mp 8Expected Behavior
All 107 GEMM shapes should be tuned successfully without crashes.
Actual Behavior
- Process crashes with GPU core dump after tuning 7%-79% of shapes
- Different shapes fail on different runs (non-deterministic)
- Error:
Memory access fault by GPU node-X on address 0x7f... - Core dump files generated (~1.2 GB each)
Workaround (Single-GPU Works)
python gradlib/gradlib/gemm_tuner_parallel.py \
-i aiter/configs/router_gemm_bf16_to_fp32_untuned.csv \
-o aiter/configs/router_gemm_bf16_to_fp32_tuned.csv \
--mp 1This completes successfully without any crashes, confirming the issue is specific to multi-GPU parallel execution.
Sample Crash Log
[aiter] processed 22 batches of 52, Processing Status ====> 42.0% tuned
M N K bias dtype outdtype 1376 128 4096 False torch.bfloat16 torch.float32 False >>> Total hipb solutions 1706
[aiter] import [module_hipbsolgemm] under .../aiter/jit/module_hipbsolgemm.so (8 workers)
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058
Reproducibility
- Multi-GPU (--mp 8): Crashes 100% of the time (6/6 runs tested)
- Single-GPU (--mp 1): Works 100% of the time (no crashes)
- Crash occurs at different shapes on each run (non-deterministic)
router_gemm_bf16_to_fp32_untuned.csv
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module version 6.7.0 is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFF
FF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 9754 128-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 9754 128-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3100
BDFID: 0
Internal Node ID: 0
Compute Unit: 256
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Coherent Host Access: FALSE
Features: None
Fast F16 Operation: FALSE
Wavefront Size: 0(0x0)
Workgroup Max Size: 0(0x0)
Workgroup Max Size per Dimension:
x 0(0x0)
y 0(0x0)
z 0(0x0)
Max Waves Per CU: 0(0x0)
Max Work-item Per CU: 0(0x0)
Grid Max Size: 0(0x0)
Grid Max Size per Dimension:
x 0(0x0)
y 0(0x0)
z 0(0x0)
Max fbarriers/Workgrp: 0
Packet Processor uCode:: 0
SDMA engine uCode:: 0
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED HOST ALLOC
Size: 1056561152(0x3ef80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED HOST ALLOC
Size: 1056561152(0x3ef80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx942
Uuid: GPU-a2d5d6b7e1c7e5e8
Marketing Name: AMD Instinct MI308X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 32768(0x8000) KB
L3: 262144(0x40000) KB
Chip ID: 29858(0x74a2)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 49152
Internal Node ID: 2
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 186
SDMA engine uCode:: 24
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 3-9 (GPU 1-7)
*******
(Same configuration as Agent 2, all MI308X gfx942)
*** Done ***
Key Information:
- 8x AMD Instinct MI308X (gfx942)
- 304 Compute Units per GPU
- 192 GB memory per GPU (COARSE/FINE/EXTENDED FINE GRAINED pools)
- ROCm Runtime 1.14, Ext 1.6
- ISA: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Additional Information
Core Dump Files
6 core dump files generated (~1.2 GB each):
gpucore.213735,gpucore.378428,gpucore.554058gpucore.560948,gpucore.575199,gpucore.625995
Available upon request (too large for direct upload). Can provide via cloud storage or extract specific information if needed.
System Configuration Issues
NUMA Balancing: Currently ENABLED (value: 1)
$ cat /proc/sys/kernel/numa_balancing
1sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'However, even with NUMA balancing enabled, single-GPU mode works fine, suggesting the issue is specifically in hipBLASLt's multi-GPU handling.
ROCTracer Warnings
All crashes are preceded by multiple ROCTracer warnings:
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4 (function operator())
These warnings appear consistently across all 8 GPU workers before the memory access fault.
Detailed Crash Logs
Full logs for all 6 crash occurrences available in attached GPU_CORE_DUMP_ISSUE_REPORT.md.
Sample pattern:
(M=1344 completed successfully)
[aiter] processed 22 batches of 52, Processing Status ====> 42.0% tuned
M N K bias dtype outdtype 1376 128 4096 False torch.bfloat16 torch.float32 False >>> Total hipb solutions 1706
[aiter] import [module_hipbsolgemm] (8 workers spawned)
[W126 03:03:30.478958788 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
[W126 03:03:30.787522382 collection.cpp:1116] Warning: ROCTracer produced duplicate flow start: 4
Memory access fault by GPU node-2 (Agent handle: 0x55c77d07fd80) on address 0x7f0800018000. Reason: Unknown.
GPU core dump created: gpucore.554058
Input File
router_gemm_bf16_to_fp32_untuned.csv - 107 GEMM shapes for router operations:
- M: 1-32 (consecutive), then 40, 48, 56, 64, 96, 128, ..., up to 10240
- N: 128 (constant)
- K: 4096 (constant)
- dtype: bfloat16 → float32
- All shapes: no bias, no scaleAB, no preshuffle
Can be provided as attachment or GitHub Gist.
Workarounds Tested
- ✅ Single-GPU mode (
--mp 1): Works perfectly, 100% success rate ⚠️ Reduced parallelism (--mp 4,--mp 2): Not tested yet, but likely to reduce crash frequency- ❌ Disabling NUMA balancing: Not tested yet (requires root access)
Questions for ROCm Team
- Are there known concurrency issues in hipBLASLt 1.1.0 with multi-GPU solution testing?
- Should explicit synchronization barriers be added between parallel workers?
- How to properly analyze the GPU core dump files to identify the specific hipBLASLt kernel causing the fault?
- Is there a recommended maximum parallelism level for hipBLASLt tuning on MI308X?
- Could the 0x7f... memory addresses (host memory region) indicate a host-device synchronization issue?
Repository Information
- Code: aiter_extensions (commit: d8d8b9023cd3d081120e820d1c62718cad0d15f3)
- Tuning script:
gradlib/gradlib/gemm_tuner_parallel.py - GEMM tuner:
gradlib/gradlib/GemmTuner.py - Multi-process tuner:
aiter/utility/mp_tuner.py
Environment Variables
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7All 8 GPUs are visible and accessible. Single-GPU tests confirm each GPU works individually.