llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
You can install llm-jp-eval-mm from GitHub or via PyPI.
- Option 1: Clone from GitHub (Recommended)
git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync
- Option 2: Install via PyPI
pip install eval_mm
To use LLM-as-a-Judge, configure your OpenAI API keys in a.env
file:
- For Azure: Set
AZURE_OPENAI_ENDPOINT
andAZURE_OPENAI_KEY
- For OpenAI: Set
OPENAI_API_KEY
If you are not using LLM-as-a-Judge, you can assign any value in the .env
file to bypass the error.
To evaluate a model on a task, run the following command:
uv sync --group normal
uv run --group normal python examples/sample.py \
--model_id llava-hf/llava-1.5-7b-hf \
--task_id japanese-heron-bench \
--result_dir result \
--metrics heron-bench \
--judge_model gpt-4o-2024-11-20 \
--overwrite
The evaluation results will be saved in the result directory:
result
├── japanese-heron-bench
│ ├── llava-hf
│ │ ├── llava-1.5-7b-hf
│ │ │ ├── evaluation.jsonl
│ │ │ └── prediction.jsonl
To evaluate multiple models on multiple tasks, please check eval_all.sh
.
You can integrate llm-jp-eval-mm into your own code. Here's an example:
from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
class MockVLM:
def generate(self, images: list[Image.Image], text: str) -> str:
return "宮崎駿"
task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]
input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)
model = MockVLM()
prediction = model.generate(images, input_text)
scorer = ScorerRegistry.load_scorer(
"rougel",
ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
To generate a leaderboard from your evaluation results, run:
python scripts/make_leaderboard.py --result_dir result
This will create a leaderboard.md
file with your model performance:
Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
---|---|---|---|
llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | 52.4 |
Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 |
google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 |
microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 |
gpt-4o-2024-11-20 | 93.7 | 4.44 | 32.2 |
The official leaderboard is available here
Japanese Tasks:
- Japanese Heron Bench
- JA-VG-VQA500
- JA-VLM-Bench-In-the-Wild
- JA-Multi-Image-VQA
- JDocQA
- JMMMU
- JIC-VQA
- MECHA-ja
English Tasks:
We use uv’s dependency groups to manage each model’s dependencies.
For example, to use llm-jp/llm-jp-3-vila-14b, run:
uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py
See eval_all.sh
for the complete list of model dependencies.
When adding a new group, remember to configure conflict.
uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf
To add a new task, implement the Task class in src/eval_mm/tasks/task.py
.
To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py
.
To add a new model, implement the VLM class in examples/base_vlm.py
Install a new dependency using the following command:
uv add <package_name>
uv add --group <group_name> <package_name>
Run the following commands to test tasks, metrics, and models::
bash test.sh
bash test_model.sh
Ensure code consistency with:
uv run ruff format src
uv run ruff check --fix src
To release a new version:
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags
For website updates, see github_pages/README.md.
To update leaderboard data:
python scripts/make_leaderboard.py --update_pages
- Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
- lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.
We also thank the developers of the evaluation datasets for their hard work.