-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
|
Before I can do my review, can you explicitly disclose any AI usage in this PR? This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md |
|
@ngxson Is this worth it, re your stance here? #17961 (comment) |
|
@CISC I stopped working on the other PR is because:
however, if the current PR doesn't add much complexity (or most complex parts is isolated in its own space) - which seems to be the case here, probably worth reviewing/merging this to unblock use cases while waiting google to release the next vision model. if it's still mobilenet, we will optimize the implementation, otherwise we leave it as-is. so my conclusion: it's still worth reviewing this PR, but don't need to be too optimized |
Hi @ngxson, I have updated the PR with AI disclosure. |
2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size
|
I’ve addressed all comments in the latest push and replied briefly inline with commit references. Requesting re-review when you have time. |
This PR implements multimodal (image) support for the Gemma3n model, which uses the MobileNetV5 architecture as its vision encoder (instead of SigLIP used in Gemma3).
Related issues
Partially Addresses #14429
Architecture Implementation
MobileNetV5 Vision Encoder:
Key Changes to existing code
convert_hf_to_gguf.py:<image_soft_token>and<audio_soft_token>with<__media__>the default markersrc/models/gemma3n-iswa.cpp:Relevant changes to files under /tools/mtmd/ to add mobilenetv5 vision encoder
Testing
Tested using mtmd cli
./llama-mtmd-cli \ -m gemma3n_e2b-it.gguf \ --mmproj mobilenetv5_e2b-it.gguf \ --no-mmproj-offload \ -fa off \ -p "Describe this image" \ --image image.pngimage.png:

Output:
AI Usage Disclosure
Claude Code was used to explore the existing codebase, create boilerplates, initial drafts of funcs, classes and debugging & testing. Ultimately, the code has undergone heavy manual edits.