[Issue]: GLM-5 aiter fused_moe with SGLang + MI355

### Problem Description

sglang server 16 concurrency from the client side passes and then fails when concurrency is changed to 17+, crash seems to happen during fused_moe. Is there a working container I should use? 

### Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

8x AMD Instinct MI355X

### ROCm Version

ROCm 7.2

### ROCm Component

_No response_

### Steps to Reproduce

SGLANG_ROCM_FUSED_DECODE_MLA=0 python3 -m sglang.launch_server --model zai-org/GLM-5 --tp-size 8 --attention-backend triton --disable-radix-cache --watchdog-timeout 1200

python3 -m sglang.bench_serving --dataset-name random --random-range-ratio 1 --max-concurrency 17 --num-prompt 32 --random-input 1000 --random-output 60 --model zai-org/GLM-5 --warmup-requests 0  --backend sglang-oai

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

dmesg
```
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32826)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:  Process python3 pid 3247476 thread python3 pid 3247476
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:   cookie node_id 1 fault from die AID0.XCD0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:          MORE_FAULTS: 0x1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:          WALKER_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:          MAPPING_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:65:00.0: amdgpu:          RW: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32797)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:  Process python3 pid 3247479 thread python3 pid 3247479
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:   cookie node_id 2 fault from die AID0.XCD1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:          MORE_FAULTS: 0x1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:          WALKER_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:          MAPPING_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:85:00.0: amdgpu:          RW: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32800)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:  Process python3 pid 3247474 thread python3 pid 3247474
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:   cookie node_id 2 fault from die AID0.XCD1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:          MORE_FAULTS: 0x1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:          WALKER_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:          MAPPING_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:75:00.0: amdgpu:          RW: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32844)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:  Process python3 pid 3247480 thread python3 pid 3247480
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:   cookie node_id 2 fault from die AID0.XCD1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:          MORE_FAULTS: 0x1
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:          WALKER_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:          MAPPING_ERROR: 0x0
[Thu Feb 12 20:35:53 2026] amdgpu 0000:e5:00.0: amdgpu:          RW: 0x0
```
sglang server crash output
```
[2026-02-17 23:51:55] INFO:     127.0.0.1:36718 - "GET /get_server_info HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50626 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50634 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50636 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50646 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50662 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50664 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50678 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50692 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50702 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50716 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50732 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50740 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50746 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50748 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50762 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50768 - "POST /v1/completions HTTP/1.1" 200 OK
[2026-02-17 23:52:11] INFO:     127.0.0.1:50778 - "POST /v1/completions HTTP/1.1" 200 OK
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP4] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP4] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP0] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP0] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP1] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP1] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP3] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP3] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP5] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP5] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP7] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP7] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP2] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP2] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[2026-02-17 23:52:11 TP6] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = False, estimated_m_per_expert = 573
[aiter] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP6] [fused_moe] using 2stage default for (256, 16384, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
/app/py_3.12/lib/python3.12/site-packages/triton/backends/amd/compiler.py:79: UserWarning: kpack is deprecated starting from gfx950 and will be removed in later releases. So for now kpack = 2 will be overwritten to 1 to make transitioning easier.
  warnings.warn(
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:11 TP5] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP5] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:11 TP0] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:11 TP0] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:12 TP2] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:12 TP2] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:12 TP3] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:12 TP3] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:12 TP1] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:12 TP1] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:12 TP7] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:12 TP6] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:12 TP7] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:12 TP6] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
Memory access fault by GPU node-7 (Agent handle: 0x20916760) on address (nil). Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x1992e500) on address (nil). Reason: Unknown.
Memory access fault by GPU node-2 (Agent handle: 0x27c21790) on address (nil). Reason: Unknown.
[aiter] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[2026-02-17 23:52:12 TP4] run_1stage = False, ksplit = 0 q_type = QuantType.No block_m = 128 use_nt = True, estimated_m_per_expert = 35
[aiter] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
[2026-02-17 23:52:12 TP4] [fused_moe] using 2stage default for (256, 1024, 6144, 256, 257, 9, 'ActivationType.Silu', 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'QuantType.No', True, False) 
Memory access fault by GPU node-9 (Agent handle: 0x21c27f20) on address (nil). Reason: Unknown.
Memory access fault by GPU node-3 (Agent handle: 0x410a8c90) on address (nil). Reason: Unknown.
Memory access fault by GPU node-5 (Agent handle: 0x16c8a2b0) on address (nil). Reason: Unknown.
Memory access fault by GPU node-6 (Agent handle: 0x4b79ff10) on address (nil). Reason: Unknown.
Memory access fault by GPU node-8 (Agent handle: 0x169cc520) on address (nil). Reason: Unknown.
GPU coredump: Directory "/coredumps not writable or does not exist
GPU core dump failed
Fatal Python error: Aborted

Thread 0x000070b8fd5ff640 (most recent call first):
  File "/app/sglang-repo/python/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once
  File "/app/sglang-repo/python/sglang/srt/utils/watchdog.py", line GPU coredump: Directory "/coredumps not writable or does not exist
125 in _watchdog_threadGPU core dump failed

  File "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012 in runFatal Python error: 
Aborted  File 

"/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"Thread 0x, line 1075000079ed76fff640 in  (most recent call first):
_bootstrap_inner
  File   File "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py""/app/sglang-repo/python/sglang/srt/utils/watchdog.py, line "1032 in _bootstrap, line 
145
 in Thread 0x_watchdog_once000070d0559ff640
 (most recent call first):
  File "/app/sglang-repo/python/sglang/srt/utils/watchdog.py  File "", line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py125" in , line _watchdog_thread359
 in   File wait
  File "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py""/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py, line "1012 in , line 655run in 
wait  File 
"  File /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py""/app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py, line "1075 in , line 60_bootstrap_inner in 
run  File 
"  File /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py""/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py, line "1032, line  in 1075_bootstrap in 
_bootstrap_inner

Thread 0x  File 00007a04e73ff640" (most recent call first):
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"  File , line 1032 in "_bootstrap/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
"
Thread 0x, line 0000714544532740359 (most recent call first):
 in   File wait"
/app/aiter-repo/aiter/fused_moe.py  File "", line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py117" in , line fused_moe655
 in   File wait
"  File /app/sglang-repo/python/sglang/srt/layers/quantization/unquant.py""/app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py, line "407 in , line forward_cuda60
 in run  File 
  File ""/app/sglang-repo/python/sglang/srt/layers/utils/multi_platform.py/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"", line , line 107583 in  in _bootstrap_innerforward_hip

  File "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File "", line /app/sglang-repo/python/sglang/srt/layers/utils/multi_platform.py1032" in , line _bootstrap71
 in 
Thread 0xforward00007a79d6741740
 (most recent call first):
  File "/app/sglang-repo/python/sglang/srt/layers/quantization/unquant.py"  File , line 342 in "apply/app/sglang-repo/python/sglang/srt/layers/quantization/unquant.py
"  File , line "152/app/sglang-repo/python/sglang/srt/layers/moe/fused_moe_triton/layer.py in "apply, line 
1017  File  in "run_moe_core/app/sglang-repo/python/sglang/srt/layers/linear.py
"  File , line "1429/app/sglang-repo/python/sglang/srt/layers/moe/fused_moe_triton/layer.py in "forward, line 
996 in   File forward_impl
"  File /app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py""/app/sglang-repo/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line , line 1787977 in  in _call_implforward

  File   File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787 in _call_impl
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776 in _wrapped_call_impl
  File "/app/sglang-repo/python/sglang/srt/models/deepseek_v2.py", line 680 in forward_normal
  File "/app/sglang-repo/python/sglang/srt/models/deepseek_v2.py", line 582 in forward
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787 in _call_impl
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776 in _wrapped_call_impl
  File "/app/sglang-repo/python/sglang/srt/models/deepseek_v2.py", line 2421 in forward
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787 in _call_impl
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776 in _wrapped_call_impl
  File "/app/sglang-repo/python/sglang/srt/models/deepseek_v2.py", line 2730 in forward
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787 in _call_impl
  File "/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776 in _wrapped_call_impl
  File "/app/sglang-repo/python/sglang/srt/models/deepseek_v2.py", line 2919 in forward
  File "/app/py_3.12/lib/python3.12/site-packages/torch/utils/_contextlib.pyGPU coredump: Directory "/coredumps not writable or does not exist
", line GPU core dump failed
124 in decorate_context
  File "/app/sglang-repo/python/sglang/srt/model_executor/model_runner.py", line 2327 in forward_extend
  File "/app/sglang-repo/python/sglang/srt/model_executor/model_runner.py"Fatal Python error: , line Aborted2489

 in Thread 0x_forward_raw000077ec879ff640
 (most recent call first):
  File   File ""/app/sglang-repo/python/sglang/srt/model_executor/model_runner.py/app/sglang-repo/python/sglang/srt/utils/watchdog.py"", line , line 2390145 in  in forward_watchdog_once

  File   File ""/app/sglang-repo/python/sglang/srt/managers/tp_worker.py/app/sglang-repo/python/sglang/srt/utils/watchdog.py"", line GPU coredump: Directory "/coredumps not writable or does not exist
, line 456125GPU core dump failed
 in  in forward_batch_generationFatal Python error: _watchdog_thread
Aborted
  File 

  File ""Thread 0x/app/sglang-repo/python/sglang/srt/managers/scheduler.py/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py0000780c9bfff640"" (most recent call first):
, line , line   File 23411012"GPU coredump: Directory "/coredumps not writable or does not exist
 in  in /app/sglang-repo/python/sglang/srt/utils/watchdog.pyGPU core dump failed
run_batch
run"Fatal Python error:   File 
GPU coredump: Directory "/coredumps not writable or does not exist
, line AbortedGPU coredump: Directory "/coredumps not writable or does not exist
"  File "GPU core dump failed
145

GPU core dump failed
/app/sglang-repo/python/sglang/srt/managers/scheduler.py/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py in Thread 0x"Fatal Python error: Fatal Python error: "_watchdog_once, line 0000777060dff640, line AbortedAborted
1075 (most recent call first):
GPU coredump: Directory "/coredumps not writable or does not exist
1153 in 



  File Thread 0x"/app/sglang-repo/python/sglang/srt/utils/watchdog.py000076019e3ff640" (most recent call first):
 in   File GPU core dump failed
event_loop_overlapThread 0x, line   File _bootstrap_inner"
00007ee01edff640Fatal Python error: 125"
/app/sglang-repo/python/sglang/srt/utils/watchdog.py  File  (most recent call first):
Aborted

  File  in /app/sglang-repo/python/sglang/srt/utils/watchdog.py  File ""Thread 0x"_watchdog_thread"", line /app/py_3.12/lib/python3.12/site-packages/torch/utils/_contextlib.py0000788ec1fff640/app/sglang-repo/python/sglang/srt/utils/watchdog.py
, line 145/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py145" (most recent call first):
"  File  in "" in , line   File , line _watchdog_once/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py, line _watchdog_once124"145
"1032
 in /app/sglang-repo/python/sglang/srt/utils/watchdog.py in   File , line 1012 in  in   File decorate_context"_watchdog_once, line "run_bootstrap"

145/app/sglang-repo/python/sglang/srt/utils/watchdog.py

/app/sglang-repo/python/sglang/srt/utils/watchdog.py  File   File  in "  File 
"""_watchdog_once, line "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"Thread 0x, line /app/sglang-repo/python/sglang/srt/managers/scheduler.py/app/sglang-repo/python/sglang/srt/utils/watchdog.py
125, line 00007803e3fff640125""  File  in 1075 (most recent call first):
 in , line 3160, line "_watchdog_thread in   File _watchdog_thread in 125/app/sglang-repo/python/sglang/srt/utils/watchdog.py
_bootstrap_inner"
run_scheduler_process
 in "  File 
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File   File _watchdog_thread, line "  File """
125/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py, line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py  File  in ""359"""_watchdog_thread, line , line 1032 in  in , line , line 108/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
1012_bootstrap
wait1012 in run
"  File  in 
Thread 0x
 in   File , line "run000078240c9ff640  File run"1012/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
 (most recent call first):
"
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py" in "  File   File /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File , line run, line ""/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py""314
1012/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py in   File  in ", line 359 in 655"_bootstrap"run, line wait in , line 
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
1075
wait1075  File "  File  in   File 
 in ", line "_bootstrap_inner"/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File _bootstrap_inner/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py1075/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
""
" in "_bootstrap_inner  File , line /app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py  File , line , line 
"655""135 in 1075  File /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py in wait, line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py_main in "_bootstrap_inner"
60"
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
, line   File  in , line   File "  File , line "1032"run1032"1032/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py" in /app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py
 in /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py in , line _bootstrap"  File _bootstrap"_bootstrap1032
, line "

, line  in 
60/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py

122_bootstrapThread 0x in 00007618fabff640"Thread 0xThread 0x in 
run (most recent call first):
, line 00007787a7fff64000007ef693fff640spawn_main

  File 1075 (most recent call first):
 (most recent call first):

Thread 0x  File   File " in   File   File 000078a62a3ff640""/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py_bootstrap_inner"" (most recent call first):
<string>/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File """, line 359  File "", line , line , line  in ", line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py359 in 1 in 1075wait/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py359"wait<module>
 in 
" in , line 
_bootstrap_inner  File , line wait359  File 
"1032
 in "  File /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py" in   File wait/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 655_bootstrap"
"/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py in wait
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File , line 
Extension modules: "

""655numpy._core._multiarray_umath, line   File Thread 0x00007878ed350740 (most recent call first):
, line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py in 1032"/app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py,   File 655"wait in "numpy.linalg._umath_linalg" in , line 
_bootstrap, line 60 in /app/py_3.12/lib/python3.12/site-packages/torch/_ops.pywait655  File 
, run"
 in "
pybase64._pybase64
  File wait/app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.pyThread 0x  File ", charset_normalizer.md"
"00007899003a4740/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"/app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py,   File , line  (most recent call first):
, line   File "requests.packages.charset_normalizer.md"60 in 1075", line , /app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.pyrun in /app/sglang-repo/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py60requests.packages.chardet.md"
, line _bootstrap_inner" in   File 60
run" in   File 
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.pyrun"
, line   File 1075" in ",   File /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py_bootstrap_inner/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py"multidict._multidict""
, line   File , line /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py, 1075"1032 in "yarl._quoting_c in /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py_bootstrap, line , propcache._helpers_c_bootstrap_inner"
1075, aiohttp._http_writer
, line 
 in ,   File 1032Thread 0x_bootstrap_inneraiohttp._http_parser" in 0000768e028a6740
, /root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py_bootstrap" (most recent call first):
  File aiohttp._websocket.mask
, line   File ", 
1032"/app/sglang-repo/python/sglang/srt/models/deepseek_v2.py/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.pyaiohttp._websocket.reader_cThread 0x in _bootstrap""00007ef782fff640, 
, line  (most recent call first):
frozenlist._frozenlist
1032  File Thread 0x,  in "0000791b26ef2740torch._C_bootstrap/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py (most recent call first):
, 
"  File torch._C._dynamo.autograd_compiler
, line ", Thread 0x359/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.pytorch._C._dynamo.eval_frame000077fc634b3740 in ",  (most recent call first):
wait, line torch._C._dynamo.guards  File 
1787, "  File  in torch._C._dynamo.utils/app/aiter-repo/aiter/rotary_embedding.py"_call_impl, torch._C._fft"/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py
, , line "  File torch._C._linalg180, line ", torch._C._nested in 655/app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py, forward_native in "torch._C._nn
wait, line ,   File 
328  File  in "_wrapped_call_impl/app/py_3.12/lib/python3.12/site-packages/tqdm/_monitor.py
"torch._C._sparse", line   File , /app/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py60torch._C._special" in , line run0
 in   File _call_impl"
/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py  File ", line 1075 in _bootstrap_inner
  File "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/threading.py", line 1032 in _bootstrap

```

pip packages
```
amd-aiter                         0.1.10.post4.dev17+g7411c9975       /app/aiter-repo
conch-triton-kernels              1.2.1
flash_attn                        2.8.3
sgl-kernel                        0.3.21
sglang                            0.5.6.post3.dev2061+g10569d04b      /app/sglang-repo/python
sglang-router                     0.3.2
torch                             2.10.0a0+git449b176
torchao                           0.9.0
torchaudio                        2.10.0+27b7ebd
torchvision                       0.25.0+8ac84ee
transformers                      5.2.0.dev0
triton                            3.5.0+gitc3c476f3
triton_kernels                    1.0.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: GLM-5 aiter fused_moe with SGLang + MI355 #2059

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: GLM-5 aiter fused_moe with SGLang + MI355 #2059

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions