Skip to content

Conversation

@simrnsingh
Copy link

@simrnsingh simrnsingh commented Dec 21, 2025

This PR implements multimodal (image) support for the Gemma3n model, which uses the MobileNetV5 architecture as its vision encoder (instead of SigLIP used in Gemma3).

Related issues

Partially Addresses #14429

Architecture Implementation

MobileNetV5 Vision Encoder:

  • Stem convolution layer with RMSNorm2d and GELU activation
  • Edge Residual Blocks with expansion, pointwise linear conv phases
  • Universal Inverted Residual blocks with expansion, depthwise conv, and projection phases
  • MobileNet Attention (Multi-Query Attention - MQA) blocks within the CNN architecture
  • Multi-Scale Fusion Adapter (MSFA) for combining features at different resolutions

Key Changes to existing code

  • convert_hf_to_gguf.py:

    • Add Gemma3nVisionModel
    • Add padding for special (vision) tokens to token embeddings (use full vocab size) to Gemma3NModel
    • Fix chat template in Gemma3NModel by replacing <image_soft_token> and <audio_soft_token> with <__media__> the default marker
  • src/models/gemma3n-iswa.cpp:

    • Add emb input support for vision embeddings inside get_per_layer_inputs
  • Relevant changes to files under /tools/mtmd/ to add mobilenetv5 vision encoder

Testing

Tested using mtmd cli

./llama-mtmd-cli \
  -m gemma3n_e2b-it.gguf \
  --mmproj mobilenetv5_e2b-it.gguf \
  --no-mmproj-offload \
  -fa off \
  -p "Describe this image" \
  --image image.png

image.png:
image

Output:

Captured from a slightly low angle, the image showcases a dynamic moment between a sleek black cat and a small, white mouse on a worn, blue wooden floor. The cat is in mid-leap, its front paws extended forward and its body angled towards the right, suggesting a pounce. Its tail is held high and curved, adding to the sense of motion. The cat's eyes are focused intently on the mouse.

The mouse is positioned to the right of the cat, appearing to dart away. It's small and white, contrasting sharply with the dark cat. The mouse is slightly blurred, emphasizing the cat's speed and focus.

The setting appears to be a child's bedroom. In the background, there's a bed with white bedding, a window letting in bright natural light, and various toys scattered around – a red ball, a yellow and orange toy, and a cardboard box with a cutout. A wooden chair and a dark object (possibly a speaker or a piece of furniture) are visible on the right side of the frame. The walls are a light beige color.

The lighting is bright, likely from the window, casting long shadows of both the cat and the mouse across the floor. The floorboards show signs of wear and tear, with visible cracks and discoloration. The overall mood is playful and captures a classic predator-prey interaction.

AI Usage Disclosure

Claude Code was used to explore the existing codebase, create boilerplates, initial drafts of funcs, classes and debugging & testing. Ultimately, the code has undergone heavy manual edits.

@github-actions github-actions bot added model Model specific examples python python script changes labels Dec 21, 2025
@ngxson
Copy link
Collaborator

ngxson commented Dec 21, 2025

Before I can do my review, can you explicitly disclose any AI usage in this PR?

This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

@CISC
Copy link
Collaborator

CISC commented Dec 21, 2025

@ngxson Is this worth it, re your stance here? #17961 (comment)

@ngxson
Copy link
Collaborator

ngxson commented Dec 21, 2025

@CISC I stopped working on the other PR is because:

  • I know that mobilenetv5 will not have a good performance in ggml due to its convolution nature
  • there is no guarantee that they gonna reuse this vision architecture in their next model. as we already seen, both EmbeddingGemma and FunctionGemma use gemma3 text arch instead of gemma3n arch. so just my speculation, they will probably get rid of gemma3n and mobilenet arch altogether

however, if the current PR doesn't add much complexity (or most complex parts is isolated in its own space) - which seems to be the case here, probably worth reviewing/merging this to unblock use cases while waiting google to release the next vision model. if it's still mobilenet, we will optimize the implementation, otherwise we leave it as-is.

so my conclusion: it's still worth reviewing this PR, but don't need to be too optimized

@simrnsingh
Copy link
Author

Before I can do my review, can you explicitly disclose any AI usage in this PR?

This is a requirement in our contribution guide: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Hi @ngxson, I have updated the PR with AI disclosure.
Best,
Simranjeet

2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp

2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
@simrnsingh
Copy link
Author

I’ve addressed all comments in the latest push and replied briefly inline with commit references. Requesting re-review when you have time.

@simrnsingh simrnsingh requested a review from ngxson December 21, 2025 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants