Mordal is an automated framework for finding the best vision-language model (VLM) for a given task. It uses clustering and early stopping to reduce search time by 8.9×--11.6× compared to grid search, while achieving 69% higher weighted Kendall's τ than state-of-the-art model selection methods.
- Installation
- Overview
- Quick Start
- Running the Full Pipeline
- Training a Single Model
- Evaluating a Model
- Configuration
- Python 3.8+
- CUDA-capable GPU (recommended)
- PyTorch 2.0+
- Clone the repository:
git clone <repository-url>
cd mordal- Install dependencies:
pip install torch torchvision transformers
pip install scipy scikit-learn numpy
pip install click wandb tqdm
pip install pillow- Install the package in development mode:
pip install -e .- Install submodules (if needed):
# The codebase includes 3rdparty/cornstarch and 3rdparty/lmms_eval
# Make sure these are properly set upMordal automatically finds the best VLM for your task through three stages:
- Clustering - Groups similar vision encoders and language models using CKA metrics
- Early Stopping - Uses Successive Halving Algorithm to eliminate poor candidates early
- Scaling Prediction - Predicts full-data performance using log-linear scaling laws
python scripts/compute_cka_vlms.py \
--vision_encoder_names siglip clip dfn5b \
--language_model_name meta-llama/Llama-2-7b-chat-hf \
--dataset_dir /path/to/dataset \
--dataset_file_name llava_v1_5_mix665k.json \
--batch_size 8 \
--device_id 0 \
--output_file metrics/projector_cka.jsonpython run_mordal.py \
--metrics_file metrics/projector_cka.json \
--data_dir /path/to/dataset \
--data_file_name llava_v1_5_mix665k.json \
--threshold_vision 0.5 \
--threshold_llm 0.5 \
--num_epoch 20 \
--batch_size 4 \
--device_id 0 \
--use_sha \
--task_name sqa \
--num_iterations 1000python run_mordal.py \
--metrics_file metrics/projector_cka.json \
--data_dir /path/to/dataset \
--data_file_name llava_v1_5_mix665k.json \
--threshold_vision 0.5 \
--threshold_llm 0.5 \
--num_epoch 20 \
--batch_size 4 \
--device_id 0python run_mordal.py \
--metrics_file metrics/projector_cka.json \
--data_dir /path/to/dataset \
--data_file_name llava_v1_5_mix665k.json \
--threshold_vision 0.5 \
--threshold_llm 0.5 \
--num_epoch 20 \
--batch_size 4 \
--device_id 0 \
--use_sha \
--task_name sqa \
--num_iterations 1000 \
--sha_top_k 3--metrics_file: Path to CKA metrics JSON file--threshold_vision/--threshold_llm: Clustering thresholds (0-1)--use_sha: Enable early stopping--task_name: Task for evaluation (required with--use_sha)--num_iterations: Training iterations--sha_top_k: Number of top candidates to keep
python -m mordal.train \
--vision_encoder_name google/siglip-so400m-patch14-384 \
--language_model_name meta-llama/Llama-2-7b-chat-hf \
--wandb_name my_experiment \
--dataset_dir /path/to/dataset \
--dataset_file_name llava_v1_5_mix665k.json \
--num_epoch 20 \
--batch_size 4 \
--device_id 0python -m mordal.train \
--vision_encoder_name google/siglip-so400m-patch14-384 \
--language_model_name meta-llama/Llama-2-7b-chat-hf \
--wandb_name my_experiment \
--dataset_dir /path/to/dataset \
--dataset_file_name llava_v1_5_mix665k.json \
--num_epoch 20 \
--batch_size 4 \
--device_id 0 \
--projector_checkpoint_name /path/to/projector_checkpoint \
--lora_checkpoint_name /path/to/lora_checkpoint \
--checkpoint_iter_num 5000python scripts/run_single_model_eval.py \
--pretrained_model /path/to/pretrained_model \
--tasks sqa seedbench \
--batch_size 1 \
--limit 500Or using the evaluation module:
python -m mordal.eval \
--vision_encoder_name google/siglip-so400m-patch14-384 \
--language_model_name meta-llama/Llama-2-7b-chat-hf \
--projector_checkpoint_name /path/to/projector_checkpoint \
--lora_checkpoint_name /path/to/lora_checkpoint \
--task_name sqaCommon tasks: sqa, seedbench, mmmu_val (from lmms_eval)
See run_mordal.py for full model names.
LLaVA format: JSON file with image and conversations fields, plus image directory.
# Compute CKA metrics
python scripts/compute_cka_vlms.py \
--vision_encoder_names siglip clip dfn5b \
--language_model_name meta-llama/Llama-2-7b-chat-hf \
--dataset_dir /data/datasets/llava-pretrain \
--dataset_file_name blip_laion_cc_sbu_558k.json \
--batch_size 8 \
--device_id 0 \
--output_file metrics/projector_cka.json
# Run pipeline
python run_mordal.py \
--metrics_file metrics/projector_cka.json \
--data_dir /data/datasets/llava-pretrain \
--data_file_name blip_laion_cc_sbu_558k.json \
--threshold_vision 0.5 \
--threshold_llm 0.5 \
--num_epoch 20 \
--batch_size 4 \
--device_id 0If you use Mordal in your research, please cite:
@article{he2025mordal,
title={Mordal: Automated Pretrained Model Selection for Vision Language Models},
author={He, Shiqi and Jang, Insu and Chowdhury, Mosharaf},
journal={arXiv preprint arXiv:2502.00241},
year={2025}
}See LICENSE file for details.