Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker^1,∗,†, Ahmed Heakl^1,∗, Jaseel Muhammad¹,
Ritesh Thawkar¹, Omkar Thawakar¹, Senmao Li¹, Hisham Cholakkal¹,
Ian Reid¹, Eric P. Xing^1,2, Salman Khan^1,†, Fahad Shahbaz Khan^1,3,†

¹Mohamed bin Zayed University of Artificial Intelligence ²Carnegie Mellon University ³Linköping University

*Equal Contributions †Project Leaders

📣 Announcement

Mobile-O Live Demo: Interactively explore the model’s capabilities
Mobile-O is now fully released! This includes models, training and evaluation code, inference scripts, paper, and the complete mobile app.

📌 Overview

Mobile-O is a compact, efficient unified vision–language–diffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture, while running entirely on-device. It is designed specifically for mobile and edge deployment, achieving real-time performance with a small memory footprint.

🧠 Model Capabilities

🖼️ Image Generation	👁️ Image Understanding	✏️ Image Editing

🏗️ Architecture

Overall architecture of Mobile-O: a unified vision–language–diffusion model for on-device multimodal understanding and generation.

Mobile-O consists of three main components:

Vision-Language Model (VLM): A compact multimodal backbone based on FastVLM, combining a FastViT-based vision encoder with a lightweight autoregressive language model (Qwen2-0.5B) for efficient visual–text understanding.
Diffusion Decoder: A lightweight DiT-style diffusion transformer based on SANA, paired with a VAE encoder–decoder, designed for 512×512 text-to-image generation under mobile constraints.
Mobile Conditioning Projector (MCP): A novel lightweight connector (~2.4M params) that bridges the VLM and diffusion decoder using layerwise feature fusion with temperature-scaled learnable weights, depthwise-separable 1D convolutions, and efficient channel attention. Unlike query-token approaches, MCP directly conditions the diffusion model on weighted VLM hidden states with minimal overhead.

🎯 Supported Tasks

Task	Input	Output	Description
💬 Text → Text	Text	Text	General conversational AI
👁️ Image → Text	Image + Text	Text	Image understanding (VQA, OCR, reasoning)
🖼️ Text → Image	Text	Image	High-quality image generation at 512×512
✏️ Text + Image → Image	Image + Text	Image	Instruction-based image editing
🔄 Unified Training	Mixed	Mixed	Joint image generation and understanding

📱 Mobile App

Mobile-O runs entirely on-device with no cloud dependency. We release the full source code of the iOS app along with optimized MLX and CoreML model components. The app runs smoothly on iPhone 15 Pro, iPhone 16 Pro, and iPhone 17 Pro ✅.

📱 iOS App Source Code	Mobile-O-App
🧩 MLX & CoreML Models	🤗 HuggingFace

⚡ ~3-4s Image Generation • 👁️ ~0.4s Visual Understanding • 💾 < 2GB Memory Footprint

📊 Training Datasets

Stage	Description	Download
Pre-training	9M text-image pairs (JourneyDB+BLIP3o-Pretrain-Short-Caption)	🤗 HuggingFace
SFT	~105K curated prompt-image pairs	🤗 HuggingFace
Post-training	~105K unified quadruplet samples	🤗 HuggingFace

⚙️ Setup

conda create -n mobileo python=3.12 -y
conda activate mobileo
pip install -r requirements.txt

🚀 Inference

Download Checkpoint

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='Amshaker/Mobile-O-0.5B', repo_type='model', local_dir='checkpoints', allow_patterns=['final_merged_model_23620/*']))"

1. Image Understanding

python infer_image_understanding.py \
    --model_path /HF_model/checkpoint/path/ \
    --image_path assets/cute_cat.png \
    --prompt "What is in the image?"

2. Image Generation

python infer_image_generation.py \
    --model_path /HF_model/checkpoint/path/ \
    --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"

3. Image Editing

python infer_image_editing.py \
    --model_path /HF_model/checkpoint/path/ \
    --image_path assets/cute_cat.png \
    --prompt "Make the cat wear a hat"

🏋️ Training

Stage 1: Pretraining (Cross-Modal Alignment)

We pretrain the DiT and Mobile Conditioning Projector (MCP) components on 9M text-image pairs from JourneyDB (4M) and BLIP3o-Short-Caption (5M) using data. The visual encoders, LLM backbone, and VAE are frozen.

bash scripts/Mobile-O-0.5B/pretrain.sh

Stage 2: Supervised Fine-tuning (SFT)

We finetune the DiT and MCP components on ~105K curated prompt-image pairs (60K from BLIP3o + 45K from ShareGPT-4o-Image) using data. The visual encoders, LLM backbone, and VAE remain frozen.

bash scripts/Mobile-O-0.5B/sft.sh

Stage 3: Unified Multimodal Post-Training

We post-train the DiT, MCP, LLM (via LoRA), and visual encoder components on ~105K quadruplet samples in the format (generation prompt, image, question, answer) using data. Only the VAE remains frozen.

bash scripts/Mobile-O-0.5B/post_train.sh

Unified multimodal post-training: jointly optimizing image generation and visual understanding via a multi-task objective.

Merging LoRA Weights

Since the output of post-training is LoRA adaptor weights for the LLM, you can merge them with the base model using merge_lora.py to get the final merged checkpoint for inference.

python mobileo/merge_lora.py \
    --checkpoint_dir /path/to/lora_weights/ \
    --base_weights /path/to/sft_checkpoint/ \
    --output_dir /path/to/final_merged_model/

Example with actual paths:

python mobileo/merge_lora.py \
    --checkpoint_dir checkpoints/Mobile-O-0.5B-Post-Train/ \
    --base_weights checkpoints/Mobile-O-0.5B-SFT/ \
    --output_dir checkpoints/Mobile-O-0.5B-Post-Train/final_merged_model/

🎨 Qualitative Results

Generation Samples:

Qualitative Comparison:

More Generation Comparison:

More Understanding Comparison:

📄 Citation

If you find Mobile-O useful in your research, please consider citing:

@article{shaker2026mobileo,
  title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
  author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
  journal={arXiv preprint arXiv:2602.20161},
  year={2026}
}

🙏 Acknowledgements

This repo is partially built upon BLIP3o. Thanks to all the contributors for their great efforts.

📜 License

The Mobile-O models, source code, and mobile application are released exclusively for research and non-commercial use under the CC BY-NC-SA 4.0 license. Any commercial use is strictly prohibited without prior explicit written permission from the authors.

🌐 Project Page • 🚀 Live Demo • 📄 Paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

📣 Announcement

📌 Overview

🧠 Model Capabilities

🏗️ Architecture

🎯 Supported Tasks

📱 Mobile App

📊 Training Datasets

⚙️ Setup

🚀 Inference

Download Checkpoint

1. Image Understanding

2. Image Generation

3. Image Editing

🏋️ Training

Stage 1: Pretraining (Cross-Modal Alignment)

Stage 2: Supervised Fine-tuning (SFT)

Stage 3: Unified Multimodal Post-Training

Merging LoRA Weights

🎨 Qualitative Results

Generation Samples:

Qualitative Comparison:

More Generation Comparison:

More Understanding Comparison:

📄 Citation

🙏 Acknowledgements

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Mobile-O-App		Mobile-O-App
assets		assets
deepspeed_scripts		deepspeed_scripts
eval		eval
mobileo		mobileo
scripts		scripts
LICENSE.txt		LICENSE.txt
README.md		README.md
infer_image_editing.py		infer_image_editing.py
infer_image_generation.py		infer_image_generation.py
infer_image_understanding.py		infer_image_understanding.py
requirements.txt		requirements.txt
setup.py		setup.py

License

Amshaker/Mobile-O

Folders and files

Latest commit

History

Repository files navigation

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

📣 Announcement

📌 Overview

🧠 Model Capabilities

🏗️ Architecture

🎯 Supported Tasks

📱 Mobile App

📊 Training Datasets

⚙️ Setup

🚀 Inference

Download Checkpoint

1. Image Understanding

2. Image Generation

3. Image Editing

🏋️ Training

Stage 1: Pretraining (Cross-Modal Alignment)

Stage 2: Supervised Fine-tuning (SFT)

Stage 3: Unified Multimodal Post-Training

Merging LoRA Weights

🎨 Qualitative Results

Generation Samples:

Qualitative Comparison:

More Generation Comparison:

More Understanding Comparison:

📄 Citation

🙏 Acknowledgements

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages