Skip to content

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

License

Notifications You must be signed in to change notification settings

Amshaker/Mobile-O

Repository files navigation

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

arXiv Demo Project Page Models Datasets App Store

Abdelrahman Shaker1,∗,†, Ahmed Heakl1,∗, Jaseel Muhammad1,
Ritesh Thawkar1, Omkar Thawakar1, Senmao Li1, Hisham Cholakkal1,
Ian Reid1, Eric P. Xing1,2, Salman Khan1,†, Fahad Shahbaz Khan1,3,†

1Mohamed bin Zayed University of Artificial Intelligence   2Carnegie Mellon University   3Linköping University

*Equal Contributions   †Project Leaders


📣 Announcement

  • Mobile-O Live Demo: Interactively explore the model’s capabilities Mobile-O Online Demo
  • Mobile-O is now fully released! This includes models, training and evaluation code, inference scripts, paper, and the complete mobile app.

📌 Overview

Mobile-O is a compact, efficient unified vision–language–diffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture, while running entirely on-device. It is designed specifically for mobile and edge deployment, achieving real-time performance with a small memory footprint.

Mobile-O Overview

🧠 Model Capabilities

🖼️ Image Generation 👁️ Image Understanding ✏️ Image Editing
Image Generation Image Understanding Image Editing

🏗️ Architecture

Mobile-O Architecture
Overall architecture of Mobile-O: a unified vision–language–diffusion model for on-device multimodal understanding and generation.

Mobile-O consists of three main components:

  • Vision-Language Model (VLM): A compact multimodal backbone based on FastVLM, combining a FastViT-based vision encoder with a lightweight autoregressive language model (Qwen2-0.5B) for efficient visual–text understanding.

  • Diffusion Decoder: A lightweight DiT-style diffusion transformer based on SANA, paired with a VAE encoder–decoder, designed for 512×512 text-to-image generation under mobile constraints.

  • Mobile Conditioning Projector (MCP): A novel lightweight connector (~2.4M params) that bridges the VLM and diffusion decoder using layerwise feature fusion with temperature-scaled learnable weights, depthwise-separable 1D convolutions, and efficient channel attention. Unlike query-token approaches, MCP directly conditions the diffusion model on weighted VLM hidden states with minimal overhead.


🎯 Supported Tasks

Task Input Output Description
💬 Text → Text Text Text General conversational AI
👁️ Image → Text Image + Text Text Image understanding (VQA, OCR, reasoning)
🖼️ Text → Image Text Image High-quality image generation at 512×512
✏️ Text + Image → Image Image + Text Image Instruction-based image editing
🔄 Unified Training Mixed Mixed Joint image generation and understanding

📱 Mobile App

Mobile-O runs entirely on-device with no cloud dependency. We release the full source code of the iOS app along with optimized MLX and CoreML model components. The app runs smoothly on iPhone 15 Pro, iPhone 16 Pro, and iPhone 17 Pro ✅.

Download on the App Store

📱 iOS App Source Code Mobile-O-App
🧩 MLX & CoreML Models 🤗 HuggingFace

~3-4s Image Generation   •   👁️ ~0.4s Visual Understanding   •   💾 < 2GB Memory Footprint


📊 Training Datasets

Stage Description Download
Pre-training 9M text-image pairs (JourneyDB+BLIP3o-Pretrain-Short-Caption) 🤗 HuggingFace
SFT ~105K curated prompt-image pairs 🤗 HuggingFace
Post-training ~105K unified quadruplet samples 🤗 HuggingFace

⚙️ Setup

conda create -n mobileo python=3.12 -y
conda activate mobileo
pip install -r requirements.txt

🚀 Inference

Download Checkpoint

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='Amshaker/Mobile-O-0.5B', repo_type='model', local_dir='checkpoints', allow_patterns=['final_merged_model_23620/*']))"

1. Image Understanding

python infer_image_understanding.py \
    --model_path /HF_model/checkpoint/path/ \
    --image_path assets/cute_cat.png \
    --prompt "What is in the image?"

2. Image Generation

python infer_image_generation.py \
    --model_path /HF_model/checkpoint/path/ \
    --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"

3. Image Editing

python infer_image_editing.py \
    --model_path /HF_model/checkpoint/path/ \
    --image_path assets/cute_cat.png \
    --prompt "Make the cat wear a hat"

🏋️ Training

Stage 1: Pretraining (Cross-Modal Alignment)

We pretrain the DiT and Mobile Conditioning Projector (MCP) components on 9M text-image pairs from JourneyDB (4M) and BLIP3o-Short-Caption (5M) using data. The visual encoders, LLM backbone, and VAE are frozen.

bash scripts/Mobile-O-0.5B/pretrain.sh

Stage 2: Supervised Fine-tuning (SFT)

We finetune the DiT and MCP components on ~105K curated prompt-image pairs (60K from BLIP3o + 45K from ShareGPT-4o-Image) using data. The visual encoders, LLM backbone, and VAE remain frozen.

bash scripts/Mobile-O-0.5B/sft.sh

Stage 3: Unified Multimodal Post-Training

We post-train the DiT, MCP, LLM (via LoRA), and visual encoder components on ~105K quadruplet samples in the format (generation prompt, image, question, answer) using data. Only the VAE remains frozen.

bash scripts/Mobile-O-0.5B/post_train.sh

Post-Training Pipeline
Unified multimodal post-training: jointly optimizing image generation and visual understanding via a multi-task objective.

Merging LoRA Weights

Since the output of post-training is LoRA adaptor weights for the LLM, you can merge them with the base model using merge_lora.py to get the final merged checkpoint for inference.

python mobileo/merge_lora.py \
    --checkpoint_dir /path/to/lora_weights/ \
    --base_weights /path/to/sft_checkpoint/ \
    --output_dir /path/to/final_merged_model/

Example with actual paths:

python mobileo/merge_lora.py \
    --checkpoint_dir checkpoints/Mobile-O-0.5B-Post-Train/ \
    --base_weights checkpoints/Mobile-O-0.5B-SFT/ \
    --output_dir checkpoints/Mobile-O-0.5B-Post-Train/final_merged_model/

🎨 Qualitative Results

Generation Samples:

Generation Results

Qualitative Comparison:

Qualitative Results

More Generation Comparison:

Generation Comparison

More Understanding Comparison:

Understanding Comparison


📄 Citation

If you find Mobile-O useful in your research, please consider citing:

@article{shaker2026mobileo,
  title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
  author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
  journal={arXiv preprint arXiv:2602.20161},
  year={2026}
}

🙏 Acknowledgements

This repo is partially built upon BLIP3o. Thanks to all the contributors for their great efforts.


📜 License

  • The Mobile-O models, source code, and mobile application are released exclusively for research and non-commercial use under the CC BY-NC-SA 4.0 license. Any commercial use is strictly prohibited without prior explicit written permission from the authors.