Migrating from llama.cpp to ORT #8

infil00p · 2025-09-30T15:39:39Z

This branch exists to show that you can run a VLM using ONNX Runtime on Android using Rust. I leveraged pyke-ort for my ONNX Runtime bindings and heavily relied on Claude Code for the generation loop. The primary bottleneck for this project is by far the image encoder. It takes up to a minute to encode the image into tokens, and this needs to be optimized further for practical use. This is also a bottleneck when using llama.cpp, which is why I leveraged Vulkan on the Pixel 9.

The Pixel 9 is a previous generation phone, but it is in no means a "low quality device", and if it can't run well on the Pixel 9, it probably can't run well on Android.

- Replace llama.cpp implementation with SmolVLM2-500M-Video-Instruct via ONNX Runtime 2.0.0-rc.10 - Implement pure Rust dylib approach with JNI bindings for Android integration - Add SmolVLMAndroid.kt for native interface management - Configure dynamic linking strategy to resolve static linking hang issues - Update MainActivity.kt to use consistent library naming (smolvlm_snap) - Add environment setup scripts for both static and dynamic linking configurations - Remove CMake configuration in favor of cargo-ndk build system - Successfully reduce library size from 888MB+ to 40MB total with shared libraries 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixed the generation loop to properly handle image embeddings by: - Expanding prompt with <image> tokens matching SmolVLM2 structure (4x4 grid, 64 tokens per patch) - Tokenizing the expanded prompt with proper image token placeholders - Replacing image token embeddings with actual vision features from encoder - Using correct attention mask that matches the full sequence length - Following the working reference implementation pattern This resolves the repetitive output issue and allows the model to generate proper responses. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Changes: - Updated ModelManager to download uint8 quantized models instead of q4 for better XNNPack compatibility - Fixed SplashActivity download logic to only load models after ALL files complete downloading - Fixed RGB channel order conversion from Android ARGB_8888 bitmap format (RGBA in memory) - Added XNNPack availability checking and configuration with 4 threads - Rebuilt ONNX Runtime with XNNPack support enabled The VLM now generates correct output with proper colors. Performance is still limited by CPU/XNNPack on quantized models, but the architecture is now ready for future migration to Executorch for better on-device acceleration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

infil00p and others added 5 commits June 2, 2025 08:20

Bump version number for release

8035507

Add the README that Claude wrote

84ef323

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating from llama.cpp to ORT #8

Migrating from llama.cpp to ORT #8

Uh oh!

infil00p commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Migrating from llama.cpp to ORT #8

Are you sure you want to change the base?

Migrating from llama.cpp to ORT #8

Uh oh!

Conversation

infil00p commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants