HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated #18202

IMbackK · 2025-12-19T13:17:08Z

On MFMA hardware, MMQ performs better for medium sized problems, while dequant+rocblas performs better for large problem sizes.

currently ggml_cuda_should_use_mmq choses based on batch size and data type. This is suboptimal for MUL_MAT_ID as, even if the involved tensors are large, we end up calling rocblas for a large number of small tensors if the number of experts is high, causing poor performance.
This pr addresses this by choosing MMQ when the number of experts is high.

branch marks on a MI100 @ 160W power limit.

Model	Microbatch size	Test	t/s master	t/s mmidopt	Speedup
gpt-oss 20B MXFP4 MoE	32	pp1024	737.25	745.02	1.01
gpt-oss 20B MXFP4 MoE	64	pp1024	962.68	974.75	1.01
gpt-oss 20B MXFP4 MoE	128	pp1024	955.28	967.76	1.01
gpt-oss 20B MXFP4 MoE	256	pp1024	1720.56	1725.10	1.00
gpt-oss 20B MXFP4 MoE	512	pp1024	2277.16	2291.13	1.01
gpt-oss 20B MXFP4 MoE	1024	pp1024	2665.15	2685.24	1.01
qwen3moe 30B.A3B Q4_K_M	32	pp1024	436.42	434.94	1.00
qwen3moe 30B.A3B Q4_K_M	64	pp1024	562.45	563.55	1.00
qwen3moe 30B.A3B Q4_K_M	128	pp1024	716.47	721.23	1.01
qwen3moe 30B.A3B Q4_K_M	256	pp1024	1032.03	1124.19	1.09
qwen3moe 30B.A3B Q4_K_M	512	pp1024	782.11	1497.25	1.91
qwen3moe 30B.A3B Q4_K_M	1024	pp1024	1058.36	1738.98	1.64

future note: possibly it would be better to select based on the size of the resulting splits.

…plits would be generated

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of s…

7b05bd2

…plits would be generated

IMbackK requested a review from JohannesGaessler as a code owner December 19, 2025 13:17

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 19, 2025

loci-dev mentioned this pull request Dec 19, 2025

UPSTREAM PR #18202: HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated auroralabs-loci/llama.cpp#623

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated #18202

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated #18202

IMbackK commented Dec 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated #18202

Are you sure you want to change the base?

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated #18202

Conversation

IMbackK commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

IMbackK commented Dec 19, 2025 •

edited

Loading