Skip to content

Conversation

@horasal
Copy link

@horasal horasal commented Oct 26, 2025

Make sure to read the contributing guidelines before submitting a PR

  • Add basic MXFP6 E3M2 and E2M3 to ggml
  • CPU-only implementation (without CUDA and AVX2)

test-quantize-* passed in local CI.

d:\llama.cpp> ./build_cpu/bin/Debug/test-quantize-fns.exe
....
Testing tq2_0
Testing mxfp4
Testing mxfp6_e3m2
Testing mxfp6_e2m3

d:\llama.cpp> ./build_cpu/bin/Debug/test-quantize-perf.exe
...
mxfp4
  quantize_row_q_reference
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.08 GB/s
      quantized throughput :      0.01 GB/s

  quantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.08 GB/s
      quantized throughput :      0.01 GB/s

  dequantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.18 GB/s
      quantized throughput :      0.29 GB/s

  quantize_row_q_dot
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.46 GB/s
      quantized throughput :      0.33 GB/s

  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      5.87 GB/s
      quantized throughput :      0.78 GB/s

mxfp6_e3m2
  quantize_row_q_reference
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  quantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  dequantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.03 GB/s
      quantized throughput :      0.40 GB/s

  quantize_row_q_dot
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.46 GB/s
      quantized throughput :      0.48 GB/s

  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.06 GB/s
      quantized throughput :      0.40 GB/s

mxfp6_e2m3
  quantize_row_q_reference
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  quantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      0.02 GB/s
      quantized throughput :      0.00 GB/s

  dequantize_row_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.03 GB/s
      quantized throughput :      0.40 GB/s

  quantize_row_q_dot
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.42 GB/s
      quantized throughput :      0.47 GB/s

  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      2.72 GB/s
      quantized throughput :      0.53 GB/s
      Start 29: test-gguf
29/35 Test #29: test-gguf .........................   Passed    0.15 sec
      Start 32: test-barrier
30/35 Test #32: test-barrier ......................   Passed    1.16 sec
      Start 33: test-quantize-fns
31/35 Test #33: test-quantize-fns .................   Passed   17.83 sec
      Start 34: test-quantize-perf
32/35 Test #34: test-quantize-perf ................   Passed    0.28 sec
      Start 35: test-rope
33/35 Test #35: test-rope .........................   Passed    0.06 sec
      Start 36: test-mtmd-c-api
34/35 Test #36: test-mtmd-c-api ...................   Passed    0.01 sec
      Start 37: test-alloc
35/35 Test #37: test-alloc ........................   Passed    0.00 sec

97% tests passed, 1 tests failed out of 35

Label Time Summary:
main    =  46.14 sec*proc (35 tests)

Total Test time (real) =  46.15 sec

The following tests FAILED:
	 14 - test-tokenizers-ggml-vocabs (Failed)
Errors while running CTest

real	0m46.164s
user	1m14.081s
sys	0m1.035s

test-tokenizer-ggml-vocabs reports failure but I don't think it's caused by this PR: (as this pr does not change gguf parser)

main : reading vocab from: 'models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf'
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 5 7600 6-Core Processor)
gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from /models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025
@horasal horasal marked this pull request as draft October 26, 2025 08:33
@CISC
Copy link
Collaborator

CISC commented Oct 26, 2025

test-tokenizer-ggml-vocabs reports failure but I don't think it's caused by this PR: (as this pr does not change gguf parser)

main : reading vocab from: 'models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf'
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 5 7600 6-Core Processor)
gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'
llama_model_load: error loading model: llama_model_loader: failed to load model from /models/ggml-vocabs/PLaMo2/ggml-vocab-plamo2.gguf

This happens because you have not installed/inited git-lfs, so the vocab files are not downloaded properly.

@horasal
Copy link
Author

horasal commented Oct 27, 2025

installed/inited git-lfs

Thanks for your advice.
Just performed some tests by quantizing existing model to mxfp6 and it seems work well now.

@horasal horasal marked this pull request as ready for review October 27, 2025 00:31
@CISC
Copy link
Collaborator

CISC commented Oct 27, 2025

Please restore formatting to constants.py and quants.py.

@slaren
Copy link
Member

slaren commented Oct 28, 2025

What is the motivation for this type? Are there any models being natively distributed in MXFP6, or does it perform better than other quantizations?

@jacekpoplawski
Copy link
Contributor

What is the motivation for this type? Are there any models being natively distributed in MXFP6, or does it perform better than other quantizations?

probably Blackwell support

@horasal
Copy link
Author

horasal commented Oct 28, 2025

What is the motivation for this type? Are there any models being natively distributed in MXFP6, or does it perform better than other quantizations?

Currently, there are no models natively distributed in MXFP6, but I think MXFP6 may offer a good balance between model quality and performance in the future :)

NVIDIA's Blackwell architecture is expected to support MXFP6, and AMD MI355X also includes MXFP6 support. Additionally, while MXFP4 has shown promising results with QAT, some paper such as Table 2 and 3 in this papers reports that MXFP4 may not perform as well under direct quantization (which is one of wide use-cases of llama.cpp). In contrast, MXFP6 appears to be more robust in such settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants