Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

simrnsingh · 2025-12-21T12:46:34Z

This PR implements multimodal (image) support for the Gemma3n model, which uses the MobileNetV5 architecture as its vision encoder (instead of SigLIP used in Gemma3).

Related issues

Partially Addresses #14429

Architecture Implementation

MobileNetV5 Vision Encoder:

Stem convolution layer with RMSNorm2d and GELU activation
Edge Residual Blocks with expansion, pointwise linear conv phases
Universal Inverted Residual blocks with expansion, depthwise conv, and projection phases
MobileNet Attention (Multi-Query Attention - MQA) blocks within the CNN architecture
Multi-Scale Fusion Adapter (MSFA) for combining features at different resolutions

Key Changes to existing code

convert_hf_to_gguf.py:
- Add Gemma3nVisionModel
- Add padding for special (vision) tokens to token embeddings (use full vocab size) to Gemma3NModel
- Fix chat template in Gemma3NModel by replacing <image_soft_token> and <audio_soft_token> with <__media__> the default marker
src/models/gemma3n-iswa.cpp:
- Add emb input support for vision embeddings inside get_per_layer_inputs
Relevant changes to files under /tools/mtmd/ to add mobilenetv5 vision encoder

Testing

Tested using mtmd cli

./llama-mtmd-cli \
  -m gemma3n_e2b-it.gguf \
  --mmproj mobilenetv5_e2b-it.gguf \
  --no-mmproj-offload \
  -fa off \
  -p "Describe this image" \
  --image image.png

image.png:

Output:

Captured from a slightly low angle, the image showcases a dynamic moment between a sleek black cat and a small, white mouse on a worn, blue wooden floor. The cat is in mid-leap, its front paws extended forward and its body angled towards the right, suggesting a pounce. Its tail is held high and curved, adding to the sense of motion. The cat's eyes are focused intently on the mouse.

The mouse is positioned to the right of the cat, appearing to dart away. It's small and white, contrasting sharply with the dark cat. The mouse is slightly blurred, emphasizing the cat's speed and focus.

The setting appears to be a child's bedroom. In the background, there's a bed with white bedding, a window letting in bright natural light, and various toys scattered around – a red ball, a yellow and orange toy, and a cardboard box with a cutout. A wooden chair and a dark object (possibly a speaker or a piece of furniture) are visible on the right side of the frame. The walls are a light beige color.

The lighting is bright, likely from the window, casting long shadows of both the cat and the mouse across the floor. The floorboards show signs of wear and tear, with visible cracks and discoloration. The overall mood is playful and captures a classic predator-prey interaction.

AI Usage Disclosure

Claude Code was used to explore the existing codebase, create boilerplates, initial drafts of funcs, classes and debugging & testing. Ultimately, the code has undergone heavy manual edits.

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

ngxson · 2025-12-21T13:02:01Z

Before I can do my review, can you explicitly disclose any AI usage in this PR?

This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

CISC · 2025-12-21T13:07:58Z

@ngxson Is this worth it, re your stance here? #17961 (comment)

ngxson · 2025-12-21T13:22:15Z

@CISC I stopped working on the other PR is because:

I know that mobilenetv5 will not have a good performance in ggml due to its convolution nature
there is no guarantee that they gonna reuse this vision architecture in their next model. as we already seen, both EmbeddingGemma and FunctionGemma use gemma3 text arch instead of gemma3n arch. so just my speculation, they will probably get rid of gemma3n and mobilenet arch altogether

however, if the current PR doesn't add much complexity (or most complex parts is isolated in its own space) - which seems to be the case here, probably worth reviewing/merging this to unblock use cases while waiting google to release the next vision model. if it's still mobilenet, we will optimize the implementation, otherwise we leave it as-is.

so my conclusion: it's still worth reviewing this PR, but don't need to be too optimized

simrnsingh · 2025-12-21T13:33:42Z

Before I can do my review, can you explicitly disclose any AI usage in this PR?

This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Hi @ngxson, I have updated the PR with AI disclosure.
Best,
Simranjeet

tools/mtmd/clip.cpp

tools/mtmd/clip-graph.h

tools/mtmd/clip.cpp

tools/mtmd/clip.h

convert_hf_to_gguf.py

2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size

simrnsingh · 2025-12-21T19:40:57Z

I’ve addressed all comments in the latest push and replied briefly inline with commit references. Requesting re-review when you have time.

simrnsingh added 6 commits December 19, 2025 20:07

Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to conv…

3e4c8f8

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

Add mobilenetv5 impl

ad5ed98

Fix comments, remove unused vars

f577054

Fix permute and remove transpose of projection weights

4589d3e

Merge branch 'master' into feat-gemma3n-vision

28d39cb

Fix comments, remove debugging prints from hf_to_gguf

47423a2

simrnsingh requested review from CISC and ngxson as code owners December 21, 2025 12:46

github-actions bot added model Model specific examples python python script changes labels Dec 21, 2025

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #18256: Add Gemma3n multimodal support with MobileNetV5 vision encoder auroralabs-loci/llama.cpp#648

Open

ngxson requested changes Dec 21, 2025

View reviewed changes

simrnsingh added 2 commits December 21, 2025 19:13

1. Hard-code image_mean = 0 and image_std = 1

67801e5

2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

simrnsingh requested a review from ngxson December 21, 2025 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

simrnsingh commented Dec 21, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

simrnsingh commented Dec 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simrnsingh commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

Are you sure you want to change the base?

Add Gemma3n multimodal support with MobileNetV5 vision encoder #18256

Conversation

simrnsingh commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Architecture Implementation

Testing

AI Usage Disclosure

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

simrnsingh commented Dec 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simrnsingh commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simrnsingh commented Dec 21, 2025 •

edited

Loading