Abdelrahman Shaker1,∗,†, Ahmed Heakl1,∗, Jaseel Muhammad1,
Ritesh Thawkar1, Omkar Thawakar1, Senmao Li1, Hisham Cholakkal1,
Ian Reid1, Eric P. Xing1,2, Salman Khan1,†, Fahad Shahbaz Khan1,3,†
1Mohamed bin Zayed University of Artificial Intelligence 2Carnegie Mellon University 3Linköping University
*Equal Contributions †Project Leaders
- Mobile-O Live Demo: Interactively explore the model’s capabilities
- Mobile-O is now fully released! This includes models, training and evaluation code, inference scripts, paper, and the complete mobile app.
Mobile-O is a compact, efficient unified vision–language–diffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture, while running entirely on-device. It is designed specifically for mobile and edge deployment, achieving real-time performance with a small memory footprint.
| 🖼️ Image Generation | 👁️ Image Understanding | ✏️ Image Editing |
|---|---|---|
![]() |
![]() |
![]() |
Overall architecture of Mobile-O: a unified vision–language–diffusion model for on-device multimodal understanding and generation.
Mobile-O consists of three main components:
-
Vision-Language Model (VLM): A compact multimodal backbone based on FastVLM, combining a FastViT-based vision encoder with a lightweight autoregressive language model (Qwen2-0.5B) for efficient visual–text understanding.
-
Diffusion Decoder: A lightweight DiT-style diffusion transformer based on SANA, paired with a VAE encoder–decoder, designed for 512×512 text-to-image generation under mobile constraints.
-
Mobile Conditioning Projector (MCP): A novel lightweight connector (~2.4M params) that bridges the VLM and diffusion decoder using layerwise feature fusion with temperature-scaled learnable weights, depthwise-separable 1D convolutions, and efficient channel attention. Unlike query-token approaches, MCP directly conditions the diffusion model on weighted VLM hidden states with minimal overhead.
| Task | Input | Output | Description |
|---|---|---|---|
| 💬 Text → Text | Text | Text | General conversational AI |
| 👁️ Image → Text | Image + Text | Text | Image understanding (VQA, OCR, reasoning) |
| 🖼️ Text → Image | Text | Image | High-quality image generation at 512×512 |
| ✏️ Text + Image → Image | Image + Text | Image | Instruction-based image editing |
| 🔄 Unified Training | Mixed | Mixed | Joint image generation and understanding |
Mobile-O runs entirely on-device with no cloud dependency. We release the full source code of the iOS app along with optimized MLX and CoreML model components. The app runs smoothly on iPhone 15 Pro, iPhone 16 Pro, and iPhone 17 Pro ✅.
| 📱 iOS App Source Code | Mobile-O-App |
| 🧩 MLX & CoreML Models | 🤗 HuggingFace |
⚡ ~3-4s Image Generation • 👁️ ~0.4s Visual Understanding • 💾 < 2GB Memory Footprint
| Stage | Description | Download |
|---|---|---|
| Pre-training | 9M text-image pairs (JourneyDB+BLIP3o-Pretrain-Short-Caption) | 🤗 HuggingFace |
| SFT | ~105K curated prompt-image pairs | 🤗 HuggingFace |
| Post-training | ~105K unified quadruplet samples | 🤗 HuggingFace |
conda create -n mobileo python=3.12 -y
conda activate mobileo
pip install -r requirements.txtpython -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='Amshaker/Mobile-O-0.5B', repo_type='model', local_dir='checkpoints', allow_patterns=['final_merged_model_23620/*']))"python infer_image_understanding.py \
--model_path /HF_model/checkpoint/path/ \
--image_path assets/cute_cat.png \
--prompt "What is in the image?"python infer_image_generation.py \
--model_path /HF_model/checkpoint/path/ \
--prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"python infer_image_editing.py \
--model_path /HF_model/checkpoint/path/ \
--image_path assets/cute_cat.png \
--prompt "Make the cat wear a hat"We pretrain the DiT and Mobile Conditioning Projector (MCP) components on 9M text-image pairs from JourneyDB (4M) and BLIP3o-Short-Caption (5M) using data. The visual encoders, LLM backbone, and VAE are frozen.
bash scripts/Mobile-O-0.5B/pretrain.shWe finetune the DiT and MCP components on ~105K curated prompt-image pairs (60K from BLIP3o + 45K from ShareGPT-4o-Image) using data. The visual encoders, LLM backbone, and VAE remain frozen.
bash scripts/Mobile-O-0.5B/sft.shWe post-train the DiT, MCP, LLM (via LoRA), and visual encoder components on ~105K quadruplet samples in the format (generation prompt, image, question, answer) using data. Only the VAE remains frozen.
bash scripts/Mobile-O-0.5B/post_train.sh
Unified multimodal post-training: jointly optimizing image generation and visual understanding via a multi-task objective.
Since the output of post-training is LoRA adaptor weights for the LLM, you can merge them with the base model using merge_lora.py to get the final merged checkpoint for inference.
python mobileo/merge_lora.py \
--checkpoint_dir /path/to/lora_weights/ \
--base_weights /path/to/sft_checkpoint/ \
--output_dir /path/to/final_merged_model/Example with actual paths:
python mobileo/merge_lora.py \
--checkpoint_dir checkpoints/Mobile-O-0.5B-Post-Train/ \
--base_weights checkpoints/Mobile-O-0.5B-SFT/ \
--output_dir checkpoints/Mobile-O-0.5B-Post-Train/final_merged_model/If you find Mobile-O useful in your research, please consider citing:
@article{shaker2026mobileo,
title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv preprint arXiv:2602.20161},
year={2026}
}This repo is partially built upon BLIP3o. Thanks to all the contributors for their great efforts.
- The Mobile-O models, source code, and mobile application are released exclusively for research and non-commercial use under the CC BY-NC-SA 4.0 license. Any commercial use is strictly prohibited without prior explicit written permission from the authors.







