llm-jp-eval-mm

llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

Option 1: Clone from GitHub (Recommended)

git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync

Option 2: Install via PyPI

pip install eval_mm

To use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:

For Azure: Set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY
For OpenAI: Set OPENAI_API_KEY

If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.

Usage

To evaluate a model on a task, run the following command:

uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics heron-bench \
  --judge_model gpt-4o-2024-11-20 \
  --overwrite

The evaluation results will be saved in the result directory:

result
├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl

To evaluate multiple models on multiple tasks, please check eval_all.sh.

Hello World Example

You can integrate llm-jp-eval-mm into your own code. Here's an example:

from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig

class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"

task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]

input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)

model = MockVLM()
prediction = model.generate(images, input_text)

scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Leaderboard

To generate a leaderboard from your evaluation results, run:

python scripts/make_leaderboard.py --result_dir result

This will create a leaderboard.md file with your model performance:

Model	Heron/LLM	JVB-ItW/LLM	JVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b	68.03	4.08	52.4
Qwen/Qwen2.5-VL-7B-Instruct	70.29	4.28	29.63
google/gemma-3-27b-it	69.15	4.36	30.89
microsoft/Phi-4-multimodal-instruct	45.52	3.2	26.8
gpt-4o-2024-11-20	93.7	4.44	32.2

The official leaderboard is available here

Supported Tasks

Japanese Tasks:

English Tasks:

Managing Dependencies

We use uv’s dependency groups to manage each model’s dependencies.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py

See eval_all.sh for the complete list of model dependencies.

When adding a new group, remember to configure conflict.

Browse Predictions with Streamlit

uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf

Development

Adding a new task

To add a new task, implement the Task class in src/eval_mm/tasks/task.py.

Adding a new metric

To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.

Adding a new model

To add a new model, implement the VLM class in examples/base_vlm.py

Adding a new dependency

Install a new dependency using the following command:

uv add <package_name>
uv add --group <group_name> <package_name>

Testing

Run the following commands to test tasks, metrics, and models::

bash test.sh
bash test_model.sh

Formatting and Linting

Ensure code consistency with:

uv run ruff format src
uv run ruff check --fix src

Releasing to PyPI

To release a new version:

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Updating the Website

For website updates, see github_pages/README.md.

To update leaderboard data:

python scripts/make_leaderboard.py --update_pages

Acknowledgements

Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
github_pages		github_pages
scripts		scripts
src/eval_mm		src/eval_mm
tips		tips
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
eval_all.sh		eval_all.sh
pyproject.toml		pyproject.toml
test.sh		test.sh
test_model.sh		test_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-jp-eval-mm

Getting Started

Usage

Hello World Example

Leaderboard

Supported Tasks

Managing Dependencies

Browse Predictions with Streamlit

Development

Adding a new task

Adding a new metric

Adding a new model

Adding a new dependency

Testing

Formatting and Linting

Releasing to PyPI

Updating the Website

Acknowledgements

About

Releases 12

Packages

Contributors 3

Languages

License

llm-jp/llm-jp-eval-mm

Folders and files

Latest commit

History

Repository files navigation

llm-jp-eval-mm

Getting Started

Usage

Hello World Example

Leaderboard

Supported Tasks

Managing Dependencies

Browse Predictions with Streamlit

Development

Adding a new task

Adding a new metric

Adding a new model

Adding a new dependency

Testing

Formatting and Linting

Releasing to PyPI

Updating the Website

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 3

Languages

Packages