Skip to content

Conversation

@IMbackK
Copy link
Collaborator

@IMbackK IMbackK commented Dec 19, 2025

On MFMA hardware, MMQ performs better for medium sized problems, while dequant+rocblas performs better for large problem sizes.

currently ggml_cuda_should_use_mmq choses based on batch size and data type. This is suboptimal for MUL_MAT_ID as, even if the involved tensors are large, we end up calling rocblas for a large number of small tensors if the number of experts is high, causing poor performance.
This pr addresses this by choosing MMQ when the number of experts is high.

branch marks on a MI100 @ 160W power limit.

Model Microbatch size Test t/s master t/s mmidopt Speedup
gpt-oss 20B MXFP4 MoE 32 pp1024 737.25 745.02 1.01
gpt-oss 20B MXFP4 MoE 64 pp1024 962.68 974.75 1.01
gpt-oss 20B MXFP4 MoE 128 pp1024 955.28 967.76 1.01
gpt-oss 20B MXFP4 MoE 256 pp1024 1720.56 1725.10 1.00
gpt-oss 20B MXFP4 MoE 512 pp1024 2277.16 2291.13 1.01
gpt-oss 20B MXFP4 MoE 1024 pp1024 2665.15 2685.24 1.01
qwen3moe 30B.A3B Q4_K_M 32 pp1024 436.42 434.94 1.00
qwen3moe 30B.A3B Q4_K_M 64 pp1024 562.45 563.55 1.00
qwen3moe 30B.A3B Q4_K_M 128 pp1024 716.47 721.23 1.01
qwen3moe 30B.A3B Q4_K_M 256 pp1024 1032.03 1124.19 1.09
qwen3moe 30B.A3B Q4_K_M 512 pp1024 782.11 1497.25 1.91
qwen3moe 30B.A3B Q4_K_M 1024 pp1024 1058.36 1738.98 1.64

future note: possibly it would be better to select based on the size of the resulting splits.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant