mtmd : add qwen2vl and qwen2.5vl #13141

ngxson · 2025-04-27T18:47:24Z

Tested with:

llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF

NOTE: the Qwen2-VL-2B seems to give poor result on llama-qwen2vl-cli, could be due to incorrect n_past tracking, see test result below ; it's now giving the correct answer in llama-mtmd-cli

# WRONG result with llama-qwen2vl-cli
llama-qwen2vl-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# I'm sorry, but I cannot see any image or any people in this context. Can you provide more details or clarify what you are asking?


# CORRECT result with llama-mtmd-cli
llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person wearing a hat.

llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person's face, focusing on their eyes and the hat they are wearing.

Test results:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M

ngxson · 2025-04-28T14:31:43Z

@HimariO llama-qwen2vl-cli seems to give incorrect response with Qwen2VL:

llama-qwen2vl-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# I'm sorry, but I cannot see any image or any people in this context. Can you provide more details or clarify what you are asking?

I suspect this is due to the st_pos_id being tracked incorrectly for image. As I understand, one image is correspond to 1 temporal position instead of max(pw, ph):

llama.cpp/examples/llava/qwen2vl-cli.cpp

Line 53 in 4e87962

*st_pos_id += std::max(pw, ph);

In the implementation introduced in this PR, I only advance n_past by 1 for each image, which now giving correct result.

~~So just want to double-check with you if this way is correct. Thanks!~~

See comment below

ngxson · 2025-04-28T14:56:31Z

examples/llava/qwen2vl-test.cpp

+// THIS FILE IS ONLY USED FOR TESTING THE QWEN2VL MODEL
+// IT IS NOT A PRODUCTION CODE


@HimariO Please note that llama-qwen2vl-cli is no longer used, the target is deprecated in cmake. I still keep this file because it contains your code for testing M-RoPE

ngxson · 2025-04-28T20:16:25Z

I suspect this is due to the st_pos_id being tracked incorrectly for image. As I understand, one image is correspond to 1 temporal position instead of max(pw, ph):

Small update, if I change the mentioned line in qwen2vl-cli.cpp to:

*st_pos_id += 1;

Then, Qwen2-VL-2B-Instruct works correctly.

Edit: it could be related to #13159

ggerganov · 2025-04-29T06:55:12Z

examples/llava/clip.cpp

-    return clip_n_patches(ctx) * clip_n_mmproj_embd(ctx) * sizeof(float);
+    const int32_t nx = ctx->vision_model.hparams.image_size;
+    const int32_t ny = ctx->vision_model.hparams.image_size;
+    return clip_embd_nbytes_by_img(ctx, ny, nx);
 }

 size_t clip_embd_nbytes_by_img(const struct clip_ctx * ctx, int img_h, int img_w) {


Rename img_h and img_w to nx and ny and fix calls:

clip_embd_nbytes_by_img(ctx, ny, nx); <-- wrong clip_embd_nbytes_by_img(ctx, nx, ny);

Fixed in 496f1ce

ggerganov · 2025-04-29T06:55:57Z

examples/llava/llava.cpp

@@ -313,7 +313,7 @@ static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const cli
                image_embd + n_img_pos_out * clip_n_mmproj_embd(ctx_clip),
                image_embd_v[i],
                clip_embd_nbytes_by_img(ctx_clip, nx, ny));


This is the other call.

ngxson · 2025-04-29T08:48:07Z

Looking at the source code of transformers, I think it's correct to have 1 position per image instead of std::max(pw, ph), otherwise it make no sense to process video inputs. So I'm keeping my version as-is.

https://github.com/huggingface/transformers/blob/4602059aaee7b075ca51c21941ab2d27980bef7c/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L1460-L1471

This is also confirmed from the illustration from their paper:

If someone knows more about this, feel free to correct that.

ngxson · 2025-04-29T08:57:05Z

Running test again, all is good. So I'm merging this PR once the CI is green

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M

llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person wearing a hat.

llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the woman's face, focusing on her eyes and the hat she is wearing.

ngxson added 2 commits April 27, 2025 20:25

llava : add clip_n_output_tokens, deprecate clip_n_patches

6f7a55c

mtmd : add qwen2vl and qwen2.5vl

8742f8a

github-actions bot added the examples label Apr 27, 2025

ngxson added 4 commits April 27, 2025 21:37

decode_embd_batch::set_position_...

8646e36

Merge branch 'master' into xsn/mtmd_qwen2vl

513e9c9

working version

b303584

Merge branch 'master' into xsn/mtmd_qwen2vl

14a6ab1

ngxson marked this pull request as ready for review April 28, 2025 14:52

ngxson requested a review from ggerganov April 28, 2025 14:52

deprecate llama-qwen2vl-cli

d23fdc2

ngxson commented Apr 28, 2025

View reviewed changes

ggerganov approved these changes Apr 29, 2025

View reviewed changes

ggerganov reviewed Apr 29, 2025

View reviewed changes

ngxson added 2 commits April 29, 2025 10:41

Merge branch 'master' into xsn/mtmd_qwen2vl

e9bff42

correct order W, H of clip_embd_nbytes_by_img

496f1ce

edit existing line in hot topics

db85de1

ngxson merged commit 00e3e5a into ggml-org:master Apr 29, 2025
51 of 52 checks passed

ngxson mentioned this pull request Apr 29, 2025

server: Bring back multimodal support #8010

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd : add qwen2vl and qwen2.5vl #13141

mtmd : add qwen2vl and qwen2.5vl #13141

ngxson commented Apr 27, 2025 •

edited

Loading

ngxson commented Apr 28, 2025 •

edited

Loading

ngxson Apr 28, 2025 •

edited

Loading

ngxson commented Apr 28, 2025 •

edited

Loading

ggerganov Apr 29, 2025

ngxson Apr 29, 2025

ggerganov Apr 29, 2025

ngxson commented Apr 29, 2025 •

edited

Loading

ngxson commented Apr 29, 2025 •

edited

Loading

		// THIS FILE IS ONLY USED FOR TESTING THE QWEN2VL MODEL
		// IT IS NOT A PRODUCTION CODE

mtmd : add qwen2vl and qwen2.5vl #13141

mtmd : add qwen2vl and qwen2.5vl #13141

Conversation

ngxson commented Apr 27, 2025 • edited Loading

ngxson commented Apr 28, 2025 • edited Loading

ngxson Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson commented Apr 28, 2025 • edited Loading

ggerganov Apr 29, 2025

Choose a reason for hiding this comment

ngxson Apr 29, 2025

Choose a reason for hiding this comment

ggerganov Apr 29, 2025

Choose a reason for hiding this comment

ngxson commented Apr 29, 2025 • edited Loading

ngxson commented Apr 29, 2025 • edited Loading

ngxson commented Apr 27, 2025 •

edited

Loading

ngxson commented Apr 28, 2025 •

edited

Loading

ngxson Apr 28, 2025 •

edited

Loading

ngxson commented Apr 28, 2025 •

edited

Loading

ngxson commented Apr 29, 2025 •

edited

Loading

ngxson commented Apr 29, 2025 •

edited

Loading