Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal is an automated framework for finding the best vision-language model (VLM) for a given task. It uses clustering and early stopping to reduce search time by 8.9×--11.6× compared to grid search, while achieving 69% higher weighted Kendall's τ than state-of-the-art model selection methods.

Installation

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended)
PyTorch 2.0+

Setup

Clone the repository:

git clone <repository-url>
cd mordal

Install dependencies:

pip install torch torchvision transformers
pip install scipy scikit-learn numpy
pip install click wandb tqdm
pip install pillow

Install the package in development mode:

pip install -e .

Install submodules (if needed):

# The codebase includes 3rdparty/cornstarch and 3rdparty/lmms_eval
# Make sure these are properly set up

Overview

Mordal automatically finds the best VLM for your task through three stages:

Clustering - Groups similar vision encoders and language models using CKA metrics
Early Stopping - Uses Successive Halving Algorithm to eliminate poor candidates early
Scaling Prediction - Predicts full-data performance using log-linear scaling laws

Quick Start

Step 1: Compute CKA Metrics

python scripts/compute_cka_vlms.py \
    --vision_encoder_names siglip clip dfn5b \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --dataset_dir /path/to/dataset \
    --dataset_file_name llava_v1_5_mix665k.json \
    --batch_size 8 \
    --device_id 0 \
    --output_file metrics/projector_cka.json

Step 2: Run Mordal Pipeline

python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /path/to/dataset \
    --data_file_name llava_v1_5_mix665k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0 \
    --use_sha \
    --task_name sqa \
    --num_iterations 1000

Running the Full Pipeline

Basic Usage

python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /path/to/dataset \
    --data_file_name llava_v1_5_mix665k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0

With Early Stopping (SHA)

python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /path/to/dataset \
    --data_file_name llava_v1_5_mix665k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0 \
    --use_sha \
    --task_name sqa \
    --num_iterations 1000 \
    --sha_top_k 3

Key Parameters

--metrics_file: Path to CKA metrics JSON file
--threshold_vision / --threshold_llm: Clustering thresholds (0-1)
--use_sha: Enable early stopping
--task_name: Task for evaluation (required with --use_sha)
--num_iterations: Training iterations
--sha_top_k: Number of top candidates to keep

Training a Single Model

python -m mordal.train \
    --vision_encoder_name google/siglip-so400m-patch14-384 \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --wandb_name my_experiment \
    --dataset_dir /path/to/dataset \
    --dataset_file_name llava_v1_5_mix665k.json \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0

Resume from Checkpoint

python -m mordal.train \
    --vision_encoder_name google/siglip-so400m-patch14-384 \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --wandb_name my_experiment \
    --dataset_dir /path/to/dataset \
    --dataset_file_name llava_v1_5_mix665k.json \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0 \
    --projector_checkpoint_name /path/to/projector_checkpoint \
    --lora_checkpoint_name /path/to/lora_checkpoint \
    --checkpoint_iter_num 5000

Evaluating a Model

python scripts/run_single_model_eval.py \
    --pretrained_model /path/to/pretrained_model \
    --tasks sqa seedbench \
    --batch_size 1 \
    --limit 500

Or using the evaluation module:

python -m mordal.eval \
    --vision_encoder_name google/siglip-so400m-patch14-384 \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --projector_checkpoint_name /path/to/projector_checkpoint \
    --lora_checkpoint_name /path/to/lora_checkpoint \
    --task_name sqa

Common tasks: sqa, seedbench, mmmu_val (from lmms_eval)

Supported Models

See run_mordal.py for full model names.

Dataset Format

LLaVA format: JSON file with image and conversations fields, plus image directory.

Examples

Basic Pipeline

# Compute CKA metrics
python scripts/compute_cka_vlms.py \
    --vision_encoder_names siglip clip dfn5b \
    --language_model_name meta-llama/Llama-2-7b-chat-hf \
    --dataset_dir /data/datasets/llava-pretrain \
    --dataset_file_name blip_laion_cc_sbu_558k.json \
    --batch_size 8 \
    --device_id 0 \
    --output_file metrics/projector_cka.json

# Run pipeline
python run_mordal.py \
    --metrics_file metrics/projector_cka.json \
    --data_dir /data/datasets/llava-pretrain \
    --data_file_name blip_laion_cc_sbu_558k.json \
    --threshold_vision 0.5 \
    --threshold_llm 0.5 \
    --num_epoch 20 \
    --batch_size 4 \
    --device_id 0

Citation

If you use Mordal in your research, please cite:

@article{he2025mordal,
  title={Mordal: Automated Pretrained Model Selection for Vision Language Models},
  author={He, Shiqi and Jang, Insu and Chowdhury, Mosharaf},
  journal={arXiv preprint arXiv:2502.00241},
  year={2025}
}

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
baselines		baselines
mordal		mordal
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_mordal.py		run_mordal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mordal: Automated Pretrained Model Selection for Vision Language Models

Table of Contents

Installation

Prerequisites

Setup

Overview

Quick Start

Step 1: Compute CKA Metrics

Step 2: Run Mordal Pipeline

Running the Full Pipeline

Basic Usage

With Early Stopping (SHA)

Key Parameters

Training a Single Model

Resume from Checkpoint

Evaluating a Model

Supported Models

Dataset Format

Examples

Basic Pipeline

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mordal: Automated Pretrained Model Selection for Vision Language Models

Table of Contents

Installation

Prerequisites

Setup

Overview

Quick Start

Step 1: Compute CKA Metrics

Step 2: Run Mordal Pipeline

Running the Full Pipeline

Basic Usage

With Early Stopping (SHA)

Key Parameters

Training a Single Model

Resume from Checkpoint

Evaluating a Model

Supported Models

Dataset Format

Examples

Basic Pipeline

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages