-
Notifications
You must be signed in to change notification settings - Fork 11.6k
mtmd : add qwen2vl and qwen2.5vl #13141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@HimariO
I suspect this is due to the llama.cpp/examples/llava/qwen2vl-cli.cpp Line 53 in 4e87962
In the implementation introduced in this PR, I only advance
See comment below |
// THIS FILE IS ONLY USED FOR TESTING THE QWEN2VL MODEL | ||
// IT IS NOT A PRODUCTION CODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HimariO Please note that llama-qwen2vl-cli
is no longer used, the target is deprecated in cmake. I still keep this file because it contains your code for testing M-RoPE
Small update, if I change the mentioned line in
Then, Edit: it could be related to #13159 |
examples/llava/clip.cpp
Outdated
return clip_n_patches(ctx) * clip_n_mmproj_embd(ctx) * sizeof(float); | ||
const int32_t nx = ctx->vision_model.hparams.image_size; | ||
const int32_t ny = ctx->vision_model.hparams.image_size; | ||
return clip_embd_nbytes_by_img(ctx, ny, nx); | ||
} | ||
|
||
size_t clip_embd_nbytes_by_img(const struct clip_ctx * ctx, int img_h, int img_w) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename img_h
and img_w
to nx
and ny
and fix calls:
clip_embd_nbytes_by_img(ctx, ny, nx); <-- wrong
clip_embd_nbytes_by_img(ctx, nx, ny);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 496f1ce
@@ -313,7 +313,7 @@ static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads, const cli | |||
image_embd + n_img_pos_out * clip_n_mmproj_embd(ctx_clip), | |||
image_embd_v[i], | |||
clip_embd_nbytes_by_img(ctx_clip, nx, ny)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other call.
Running test again, all is good. So I'm merging this PR once the CI is green
llama-mtmd-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the person wearing a hat.
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF --image ../models/lenna.png -p "How many people do you see in the image? What is the composition of this image?"
# There is one person in the image. The composition of the image is a close-up of the woman's face, focusing on her eyes and the hat she is wearing. |
Tested with:
NOTE: the
Qwen2-VL-2B
seems to give poor result onllama-qwen2vl-cli
, could be due to incorrectn_past
tracking, see test result below ; it's now giving the correct answer inllama-mtmd-cli
Test results: