Skip to content

mtmd : add qwen2vl and qwen2.5vl #13141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 29, 2025
Merged

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 27, 2025

Tested with:

llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF

NOTE: the Qwen2-VL-2B seems to give poor result on llama-qwen2vl-cli, could be due to incorrect n_past tracking, see test result below ; it's now giving the correct answer in llama-mtmd-cli

# WRONG result with llama-qwen2vl-cli
llama-qwen2vl-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# I'm sorry, but I cannot see any image or any people in this context. Can you provide more details or clarify what you are asking?


# CORRECT result with llama-mtmd-cli
llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person wearing a hat.

llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person's face, focusing on their eyes and the hat they are wearing.

Test results:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 28, 2025

@HimariO llama-qwen2vl-cli seems to give incorrect response with Qwen2VL:

llama-qwen2vl-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# I'm sorry, but I cannot see any image or any people in this context. Can you provide more details or clarify what you are asking?

I suspect this is due to the st_pos_id being tracked incorrectly for image. As I understand, one image is correspond to 1 temporal position instead of max(pw, ph):

*st_pos_id += std::max(pw, ph);

In the implementation introduced in this PR, I only advance n_past by 1 for each image, which now giving correct result.

So just want to double-check with you if this way is correct. Thanks!

See comment below

@ngxson ngxson marked this pull request as ready for review April 28, 2025 14:52
@ngxson ngxson requested a review from ggerganov April 28, 2025 14:52
Comment on lines +30 to +31
// THIS FILE IS ONLY USED FOR TESTING THE QWEN2VL MODEL
// IT IS NOT A PRODUCTION CODE
Copy link
Collaborator Author

@ngxson ngxson Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HimariO Please note that llama-qwen2vl-cli is no longer used, the target is deprecated in cmake. I still keep this file because it contains your code for testing M-RoPE

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 28, 2025

I suspect this is due to the st_pos_id being tracked incorrectly for image. As I understand, one image is correspond to 1 temporal position instead of max(pw, ph):

Small update, if I change the mentioned line in qwen2vl-cli.cpp to:

*st_pos_id += 1;

Then, Qwen2-VL-2B-Instruct works correctly.


Edit: it could be related to #13159

return clip_n_patches(ctx) * clip_n_mmproj_embd(ctx) * sizeof(float);
const int32_t nx = ctx->vision_model.hparams.image_size;
const int32_t ny = ctx->vision_model.hparams.image_size;
return clip_embd_nbytes_by_img(ctx, ny, nx);
}

size_t clip_embd_nbytes_by_img(const struct clip_ctx * ctx, int img_h, int img_w) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename img_h and img_w to nx and ny and fix calls:

clip_embd_nbytes_by_img(ctx, ny, nx); <-- wrong
clip_embd_nbytes_by_img(ctx, nx, ny);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 496f1ce

@@ -313,7 +313,7 @@ static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const cli
image_embd + n_img_pos_out * clip_n_mmproj_embd(ctx_clip),
image_embd_v[i],
clip_embd_nbytes_by_img(ctx_clip, nx, ny));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other call.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 29, 2025

Looking at the source code of transformers, I think it's correct to have 1 position per image instead of std::max(pw, ph), otherwise it make no sense to process video inputs. So I'm keeping my version as-is.

https://github.com/huggingface/transformers/blob/4602059aaee7b075ca51c21941ab2d27980bef7c/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L1460-L1471

This is also confirmed from the illustration from their paper:

image

If someone knows more about this, feel free to correct that.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 29, 2025

Running test again, all is good. So I'm merging this PR once the CI is green

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person wearing a hat.

llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the woman's face, focusing on her eyes and the hat she is wearing.

@ngxson ngxson merged commit 00e3e5a into ggml-org:master Apr 29, 2025
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants