Add basic support for MXFP6_MOE quantization #16777

horasal · 2025-10-26T07:29:50Z

Make sure to read the contributing guidelines before submitting a PR

Add basic MXFP6 E3M2 and E2M3 to ggml
CPU-only implementation (without CUDA and AVX2)

test-quantize-* passed in local CI.

d:\llama.cpp> ./build_cpu/bin/Debug/test-quantize-fns.exe
....
Testing tq2_0
Testing mxfp4
Testing mxfp6_e3m2
Testing mxfp6_e2m3

d:\llama.cpp> ./build_cpu/bin/Debug/test-quantize-perf.exe
...
mxfp4
  quantize_row_q_reference
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.08 GB/s
      quantized throughput :      0.01 GB/s

  quantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.08 GB/s
      quantized throughput :      0.01 GB/s

  dequantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.18 GB/s
      quantized throughput :      0.29 GB/s

  quantize_row_q_dot
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.46 GB/s
      quantized throughput :      0.33 GB/s

  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      5.87 GB/s
      quantized throughput :      0.78 GB/s

mxfp6_e3m2
  quantize_row_q_reference
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  quantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  dequantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.03 GB/s
      quantized throughput :      0.40 GB/s

  quantize_row_q_dot
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.46 GB/s
      quantized throughput :      0.48 GB/s

  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.06 GB/s
      quantized throughput :      0.40 GB/s

mxfp6_e2m3
  quantize_row_q_reference
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  quantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  dequantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.03 GB/s
      quantized throughput :      0.40 GB/s

  quantize_row_q_dot
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.42 GB/s
      quantized throughput :      0.47 GB/s

  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.72 GB/s
      quantized throughput :      0.53 GB/s

      Start 29: test-gguf
29/35 Test #29: test-gguf .........................   Passed    0.15 sec
      Start 32: test-barrier
30/35 Test #32: test-barrier ......................   Passed    1.16 sec
      Start 33: test-quantize-fns
31/35 Test #33: test-quantize-fns .................   Passed   17.83 sec
      Start 34: test-quantize-perf
32/35 Test #34: test-quantize-perf ................   Passed    0.28 sec
      Start 35: test-rope
33/35 Test #35: test-rope .........................   Passed    0.06 sec
      Start 36: test-mtmd-c-api
34/35 Test #36: test-mtmd-c-api ...................   Passed    0.01 sec
      Start 37: test-alloc
35/35 Test #37: test-alloc ........................   Passed    0.00 sec

97% tests passed, 1 tests failed out of 35

Label Time Summary:
main    =  46.14 sec*proc (35 tests)

Total Test time (real) =  46.15 sec

The following tests FAILED:
	 14 - test-tokenizers-ggml-vocabs (Failed)
Errors while running CTest

real	0m46.164s
user	1m14.081s
sys	0m1.035s

test-tokenizer-ggml-vocabs reports failure but I don't think it's caused by this PR: (as this pr does not change gguf parser)

main : reading vocab from: 'models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf'
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 5 7600 6-Core Processor)
gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from /models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf

CISC · 2025-10-26T10:18:12Z

test-tokenizer-ggml-vocabs reports failure but I don't think it's caused by this PR: (as this pr does not change gguf parser)

main : reading vocab from: 'models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf'
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 5 7600 6-Core Processor)
gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from /models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf

This happens because you have not installed/inited git-lfs, so the vocab files are not downloaded properly.

horasal · 2025-10-27T00:31:04Z

installed/inited git-lfs

Thanks for your advice.
Just performed some tests by quantizing existing model to mxfp6 and it seems work well now.

CISC · 2025-10-27T07:48:14Z

Please restore formatting to constants.py and quants.py.

slaren · 2025-10-28T09:51:58Z

What is the motivation for this type? Are there any models being natively distributed in MXFP6, or does it perform better than other quantizations?

jacekpoplawski · 2025-10-28T09:58:46Z

What is the motivation for this type? Are there any models being natively distributed in MXFP6, or does it perform better than other quantizations?

probably Blackwell support

horasal · 2025-10-28T10:50:39Z

What is the motivation for this type? Are there any models being natively distributed in MXFP6, or does it perform better than other quantizations?

Currently, there are no models natively distributed in MXFP6, but I think MXFP6 may offer a good balance between model quality and performance in the future :)

NVIDIA's Blackwell architecture is expected to support MXFP6, and AMD MI355X also includes MXFP6 support. Additionally, while MXFP4 has shown promising results with QAT, some paper such as Table 2 and 3 in this papers reports that MXFP4 may not perform as well under direct quantization (which is one of wide use-cases of llama.cpp). In contrast, MXFP6 appears to be more robust in such settings.

horasal added 6 commits October 26, 2025 11:10

Experimental support for MXFP6_E3M2

bb9d978

Change layout to keep consistency

fb28732

FIX quantization

09a2b11

Fix scaler

d940de5

add MXFP6-E2M3

6ea1e55

Fix build error and disable AVX2

855117e

horasal requested review from CISC, ggerganov and slaren as code owners October 26, 2025 07:29

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025

horasal marked this pull request as draft October 26, 2025 08:33

add basic avx2 support

e1eac0a

horasal added 2 commits October 27, 2025 06:41

remove unused variables

1d5cab3

variable name typo in gguf-py

07826ba

horasal marked this pull request as ready for review October 27, 2025 00:31

horasal added 3 commits October 27, 2025 18:09

restore format of gguf/constants.py and quants.py

fd929c5

minor format cleanup

eb0f6c0

solve flake8 and pyright error

b175e6a

horasal added 2 commits October 28, 2025 19:18

implement E8M0_TO_FP32_ANY for better e

37e43ac

python cleanup

9f24d5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add basic support for MXFP6_MOE quantization #16777

Add basic support for MXFP6_MOE quantization #16777

Uh oh!

horasal commented Oct 26, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

horasal commented Oct 27, 2025

Uh oh!

CISC commented Oct 27, 2025

Uh oh!

slaren commented Oct 28, 2025

Uh oh!

jacekpoplawski commented Oct 28, 2025

Uh oh!

horasal commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add basic support for MXFP6_MOE quantization #16777

Are you sure you want to change the base?

Add basic support for MXFP6_MOE quantization #16777

Uh oh!

Conversation

horasal commented Oct 26, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

horasal commented Oct 27, 2025

Uh oh!

CISC commented Oct 27, 2025

Uh oh!

slaren commented Oct 28, 2025

Uh oh!

jacekpoplawski commented Oct 28, 2025

Uh oh!

horasal commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants