Skip to content

Conversation

@infil00p
Copy link
Contributor

This branch exists to show that you can run a VLM using ONNX Runtime on Android using Rust. I leveraged pyke-ort for my ONNX Runtime bindings and heavily relied on Claude Code for the generation loop. The primary bottleneck for this project is by far the image encoder. It takes up to a minute to encode the image into tokens, and this needs to be optimized further for practical use. This is also a bottleneck when using llama.cpp, which is why I leveraged Vulkan on the Pixel 9.

The Pixel 9 is a previous generation phone, but it is in no means a "low quality device", and if it can't run well on the Pixel 9, it probably can't run well on Android.

infil00p and others added 5 commits June 2, 2025 08:20
- Replace llama.cpp implementation with SmolVLM2-500M-Video-Instruct via ONNX Runtime 2.0.0-rc.10
- Implement pure Rust dylib approach with JNI bindings for Android integration
- Add SmolVLMAndroid.kt for native interface management
- Configure dynamic linking strategy to resolve static linking hang issues
- Update MainActivity.kt to use consistent library naming (smolvlm_snap)
- Add environment setup scripts for both static and dynamic linking configurations
- Remove CMake configuration in favor of cargo-ndk build system
- Successfully reduce library size from 888MB+ to 40MB total with shared libraries

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fixed the generation loop to properly handle image embeddings by:
- Expanding prompt with <image> tokens matching SmolVLM2 structure (4x4 grid, 64 tokens per patch)
- Tokenizing the expanded prompt with proper image token placeholders
- Replacing image token embeddings with actual vision features from encoder
- Using correct attention mask that matches the full sequence length
- Following the working reference implementation pattern

This resolves the repetitive output issue and allows the model to generate proper responses.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Changes:
- Updated ModelManager to download uint8 quantized models instead of q4 for better XNNPack compatibility
- Fixed SplashActivity download logic to only load models after ALL files complete downloading
- Fixed RGB channel order conversion from Android ARGB_8888 bitmap format (RGBA in memory)
- Added XNNPack availability checking and configuration with 4 threads
- Rebuilt ONNX Runtime with XNNPack support enabled

The VLM now generates correct output with proper colors. Performance is still limited by
CPU/XNNPack on quantized models, but the architecture is now ready for future migration to
Executorch for better on-device acceleration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants