Skip to content

gguf : add findNearestQuantType #1421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 6, 2025

Conversation

ngxson
Copy link
Member

@ngxson ngxson commented May 2, 2025

In this PR:

  • Move GGMLFileQuantizationType to tasks
  • Update the list of GGMLFileQuantizationType (NOTE: a File can contains multiple quants, for example Q4_K_M File is Q4_K + Q6_K)
  • For GGMLQuantizationType, add TQ1_0 and TQ2_0 tenary quants
  • Add findNearestQuantType (see below)

findNearestQuantType

This function is useful for /v2 registry endpoint with text + vision models, in case we want to pick the correspond vision model that can be paired with a text model.

The main issue is that text model can go lower than Q4, like Q3/2/1, but it is not the case for vision model, as vision model is quite sensitive to quantization.

On @bartowski1182 repos , most of vision models will have BF16, F16 and maybe Q8_0 versions. The idea is:

  • If user pick BF16/F16/Q8_0 text model, we pair it to the correspond BF16/F16/Q8_0
  • If user pick something else, like Q4_K_M, we find the nearest quant to be paired with. It's Q8_0 in this case


// This function finds the nearest quantization type that is less than or equal to the given quantization type.
// It returns undefined if no such quantization type is found.
export function findNearestQuantType(
Copy link
Member Author

@ngxson ngxson May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully disclosure: this function is written by gemini 2.5 pro 😂

@bartowski1182
Copy link

Is it necessarily true that a user may want to save ~400mb on the vision part (by going to Q8_0) if they choose a smaller quant?

Though I'm guessing this may be an optional flag?

Super interesting for the f16 vs bf16!

@ngxson
Copy link
Member Author

ngxson commented May 2, 2025

Is it necessarily true that a user may want to save ~400mb on the vision part (by going to Q8_0) if they choose a smaller quant?

Hmm yeah that's a valid question, I usually use Q8_0 when developing locally because kernels for Q8_0 are significantly faster than F16.

Even without that, saving 400MB is also significant (Ofc this is quite subjective, but personally I think 400MB is big for a vision model 😂 )

Though I'm guessing this may be an optional flag?

This may not be needed, because if you think the F16 vs Q8_0 is not diff too much, you can simply skip produce Q8 for it.

For example an if you have 400MB model in F16, then converting to Q8 will save 200MB, this is not worth the saving.

But let's see if people think otherwise, we can iterate on this later on.

@bartowski1182
Copy link

Fair enough!

Just thinking for people who might download multiple and want to test if they see a difference, it would be good to have a CLI flag that allows you to specify which specific mmproj you want to load up

I also may be mistaken and that's exactly what we can already do, I haven't opened up the code for this PR yet and probably won't till I'm at a computer

Comment on lines +315 to +327
it("should find the nearest quant (vision model)", () => {
const visionQuants = [GGMLFileQuantizationType.Q8_0, GGMLFileQuantizationType.F16, GGMLFileQuantizationType.BF16];
let nearestQuant;
// text = Q4_K_M
nearestQuant = findNearestQuantType(GGMLFileQuantizationType.Q4_K_M, visionQuants);
expect(nearestQuant).toEqual(GGMLFileQuantizationType.Q8_0);
// text = Q8_0
nearestQuant = findNearestQuantType(GGMLFileQuantizationType.Q8_0, visionQuants);
expect(nearestQuant).toEqual(GGMLFileQuantizationType.Q8_0);
// text = F16
nearestQuant = findNearestQuantType(GGMLFileQuantizationType.F16, visionQuants);
expect(nearestQuant).toEqual(GGMLFileQuantizationType.F16);
});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw @bartowski1182 , this test case is inspired from a real world scenario where we have vision quantized to F16/BF16/Q8_0, and the text can be anything else.

Feel free to suggest other test cases if you can think of any!

@ngxson
Copy link
Member Author

ngxson commented May 2, 2025

Just thinking for people who might download multiple and want to test if they see a difference, it would be good to have a CLI flag that allows you to specify which specific mmproj you want to load up

If you use llama-mtmd-cli, the -hf will respect --mmproj and --mmproj-url if you explicitly specify it. So if user wants to try different combinations, they can download mmproj files locally and use it via --mmproj

Alternatively, we can add support for "component" in tag name, like model:Q4_K_M+vQ8_0 if user wants to pair Q4 text with Q8 vision. But tbh this is quite confusing and I think downloading it locally (as said above) is more intuitive

// order of quantization, from biggest to smallest
// this list must be in sync with the order in GGMLFileQuantizationType
// the gguf.spec.ts tests are using verify if the order is correct
export const GGUF_QUANT_ORDER: GGMLFileQuantizationType[] = [
Copy link
Collaborator

@gary149 gary149 May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw interested in improving the ordering in the quant selector here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF?local-app=llama.cpp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good idea. This list is already exported and ready to be used in hub UI, do you think of any other improvements ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think we can start with it (maybe there's a few variants missing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already sync'ed this list with latest llama.cpp code, so it should be good

@ngxson
Copy link
Member Author

ngxson commented May 6, 2025

I'm merging this PR later today (unless someone oppose)

@ngxson ngxson merged commit dea956e into huggingface:main May 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants