Skip to content

SymbioticLab/Mordal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal is an automated framework for finding the best vision-language model (VLM) for a given task. It uses clustering and early stopping to reduce search time by 8.9×--11.6× compared to grid search, while achieving 69% higher weighted Kendall's τ than state-of-the-art model selection methods.

Table of Contents

Installation

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended)
  • PyTorch 2.0+

Setup

  1. Clone the repository:
git clone <repository-url>
cd mordal
  1. Install dependencies:
pip install torch torchvision transformers
pip install scipy scikit-learn numpy
pip install click wandb tqdm
pip install pillow
  1. Install the package in development mode:
pip install -e .
  1. Install submodules (if needed):
# The codebase includes 3rdparty/cornstarch and 3rdparty/lmms_eval
# Make sure these are properly set up

Overview

Mordal automatically finds the best VLM for your task through three stages:

  1. Clustering - Groups similar vision encoders and language models using CKA metrics
  2. Early Stopping - Uses Successive Halving Algorithm to eliminate poor candidates early
  3. Scaling Prediction - Predicts full-data performance using log-linear scaling laws

Quick Start

Step 1: Compute CKA Metrics

python scripts/compute_cka_vlms.py \
    --vision_encoder_names siglip clip dfn5b \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --dataset_dir /path/to/dataset \
    --dataset_file_name llava_v1_5_mix665k.json \
    --batch_size 8 \
    --device_id 0 \
    --output_file metrics/projector_cka.json

Step 2: Run Mordal Pipeline

python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /path/to/dataset \
    --data_file_name llava_v1_5_mix665k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0 \
    --use_sha \
    --task_name sqa \
    --num_iterations 1000

Running the Full Pipeline

Basic Usage

python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /path/to/dataset \
    --data_file_name llava_v1_5_mix665k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0

With Early Stopping (SHA)

python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /path/to/dataset \
    --data_file_name llava_v1_5_mix665k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0 \
    --use_sha \
    --task_name sqa \
    --num_iterations 1000 \
    --sha_top_k 3

Key Parameters

  • --metrics_file: Path to CKA metrics JSON file
  • --threshold_vision / --threshold_llm: Clustering thresholds (0-1)
  • --use_sha: Enable early stopping
  • --task_name: Task for evaluation (required with --use_sha)
  • --num_iterations: Training iterations
  • --sha_top_k: Number of top candidates to keep

Training a Single Model

python -m mordal.train \
    --vision_encoder_name google/siglip-so400m-patch14-384 \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --wandb_name my_experiment \
    --dataset_dir /path/to/dataset \
    --dataset_file_name llava_v1_5_mix665k.json \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0

Resume from Checkpoint

python -m mordal.train \
    --vision_encoder_name google/siglip-so400m-patch14-384 \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --wandb_name my_experiment \
    --dataset_dir /path/to/dataset \
    --dataset_file_name llava_v1_5_mix665k.json \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0 \
    --projector_checkpoint_name /path/to/projector_checkpoint \
    --lora_checkpoint_name /path/to/lora_checkpoint \
    --checkpoint_iter_num 5000

Evaluating a Model

python scripts/run_single_model_eval.py \
    --pretrained_model /path/to/pretrained_model \
    --tasks sqa seedbench \
    --batch_size 1 \
    --limit 500

Or using the evaluation module:

python -m mordal.eval \
    --vision_encoder_name google/siglip-so400m-patch14-384 \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --projector_checkpoint_name /path/to/projector_checkpoint \
    --lora_checkpoint_name /path/to/lora_checkpoint \
    --task_name sqa

Common tasks: sqa, seedbench, mmmu_val (from lmms_eval)

Supported Models

See run_mordal.py for full model names.

Dataset Format

LLaVA format: JSON file with image and conversations fields, plus image directory.

Examples

Basic Pipeline

# Compute CKA metrics
python scripts/compute_cka_vlms.py \
    --vision_encoder_names siglip clip dfn5b \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --dataset_dir /data/datasets/llava-pretrain \
    --dataset_file_name blip_laion_cc_sbu_558k.json \
    --batch_size 8 \
    --device_id 0 \
    --output_file metrics/projector_cka.json

# Run pipeline
python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /data/datasets/llava-pretrain \
    --data_file_name blip_laion_cc_sbu_558k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0

Citation

If you use Mordal in your research, please cite:

@article{he2025mordal,
  title={Mordal: Automated Pretrained Model Selection for Vision Language Models},
  author={He, Shiqi and Jang, Insu and Chowdhury, Mosharaf},
  journal={arXiv preprint arXiv:2502.00241},
  year={2025}
}

License

See LICENSE file for details.

About

An efficient vision language model search framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors