GitHub - MiroMindAI/MiroEval: MiroEval: A benchmark and evaluation framework for deep research agents — 100 tasks (70 text, 30 multimodal) assessed across synthesis quality, factuality, and research process. 13 systems evaluated.

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

MiroEval is a comprehensive evaluation framework for Deep Research systems, providing automated task generation and assessment across three complementary dimensions: Factual correctness, Point-wise quality, and Process quality.

Benchmark results across Text-Only and Multimodal evaluations

Quick Start

1. Setup

All three evaluation modules share a single Python environment managed by uv at the repo root:

uv sync

If you use run_eval.sh, this step is done automatically on first run.

Then configure your API keys:

cp .env.template .env   # edit with your own keys

Required keys in .env:

Variable	Used By
`OPENAI_API_KEY` / `OPENAI_BASE_URL`	All modules (judge LLM)
`SERPER_API_KEY`	Factual eval (web search)
`JINA_API_KEY`	Factual eval (web reading)

2. Prepare Input

Place your model's results as a JSON array file. Each entry should follow the schema:

{
  "id": 1,
  "rewritten_query": "The evaluation query",
  "response": "Model-generated research report",
  "process": "Research process trace (for process eval)",
  "files": []
}

The first 70 entries (text-only, files: []) and last 30 entries (multimodal, with files) are routed automatically — text entries use the text factual-eval config, multimodal entries use the multimodal config.

3. Run Evaluation

# Run all three dimensions (auto-creates venv on first run)
bash run_eval.sh --input data/method_results/my_model.json --model_name my_model

# Run specific dimensions only
bash run_eval.sh --input results.json --model_name test --evaluations factual_eval point_quality

# Or call directly with the venv python
.venv/bin/python run_eval.py --input data/method_results/my_model.json --model_name my_model

4. Results

Combined results are saved to outputs/<model_name>_<timestamp>/results.json:

{
  "model_name": "my_model",
  "entries_count": 100,
  "factual_eval": {
    "avg_right_ratio": 0.825,
    "total_statements": 1500,
    "right": 1200, "wrong": 150, "unknown": 100, "conflict": 50,
    "per_entry": { ... }
  },
  "point_quality": {
    "average_total_score": 8.5,
    "dimension_averages": { "coverage_score": 8.5, "insight_score": 8.6, ... },
    "per_entry": { ... }
  },
  "process_eval": {
    "overall_avg": 8.17,
    "intrinsic_avg": 8.1,
    "alignment_avg": 8.23,
    "dimensions": { ... },
    "per_entry": { ... }
  }
}

Architecture

MiroEval/
├── pyproject.toml             # Root uv project (manages .venv for all modules)
├── .env                       # API keys (single configuration point)
├── run_eval.py                # Unified entry point (all three dimensions)
├── run_eval.sh                # Shell wrapper (auto-creates venv)
├── eval/                      # Evaluation orchestration layer
│   ├── config.py              # Path constants, env loading
│   └── adapters/              # Per-module adapters
├── data/                      # Shared data directory
│   ├── input_queries/         # Evaluation query sets + multimodal attachments
│   └── detail_results/        # Per-task per-model intermediate scores
├── factual_eval/              # Factual evaluation (MiroFlow-based fact-checking agent)
├── point_quality/             # Quality evaluation (adaptive point-wise scoring)
├── process_eval/              # Process evaluation (intrinsic process quality + report alignment)
└── task_generation/           # Evaluation task generation pipeline

Evaluation Dimensions

Dimension	Goal	Method	Key Metric	Details
Factual Eval	Report factual correctness	Agent + web search verification	Right Ratio	factual_eval/README.md
Point Quality	Report content quality	LLM multi-dimension scoring (0-10)	Weighted Total Score	point_quality/README.md
Process Eval	Research process quality	LLM structuring + scoring (1-10)	Overall Avg (intrinsic + alignment)	process_eval/README.md

For fine-grained single-dimension evaluation, see each module's README.

Data Format

Input Queries (`data/input_queries/`)

File / Directory	Description	Count
`mirobench_text.json`	Text-only query set	70
`mirobench_multimodal.json`	Multimodal query set (with image/document attachments)	30
`multimodal-attachments/`	Attachment files referenced by multimodal queries, organized by query ID (e.g., `72/`, `93/`). Contains images, PDFs, and other documents.	---

Text query schema:

{
  "id": 1,
  "chat_id": "uuid",
  "rewritten_query": "Expanded/rewritten query",
  "annotation": {
    "category": "text",
    "language": "zh | en",
    "pattern": "T1 | T2 | T5 | T6",
    "domain": "tech | finance | medical | ..."
  }
}

Multimodal query schema:

{
  "id": 71,
  "chat_id": "uuid",
  "rewritten_query": "Expanded/rewritten query",
  "files": [
    { "filename": "attachment_71_01.jpg", "type": "image", "dir": "multimodal-attachments/71/attachment_71_01.jpg", "size": "1.5 MB" }
  ],
  "annotation": { "category": "image | doc | multi_doc", "language": "zh | en" }
}

Model Results

One JSON file per model, containing a JSON array of complete query-response pairs. Place your model's output file in data/method_results/ (text-only) or data/method_multimodal_results/ (multimodal).

Result entry schema:

{
  "id": 1,
  "rewritten_query": "Rewritten query",
  "response": "Model-generated research report",
  "process": "Research process trace",
  "files": [],
  "annotation": { ... }
}

The response field contains the model's final report. The process field contains the intermediate research process trace (needed for process eval). Multimodal entries additionally contain a files field.

Task Generation

Automated pipeline for generating high-quality deep-research evaluation queries. See task_generation/README.md for full details.

Citation

@misc{ye2026miroevalbenchmarkingmultimodaldeep,
      title={MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome},
      author={Fangda Ye and Yuxin Hu and Pengxiang Zhu and Yibo Li and Ziqi Jin and Yao Xiao and Yibo Wang and Lei Wang and Zhen Zhang and Lu Wang and Yue Deng and Bin Wang and Yifan Zhang and Liangcai Su and Xinyu Wang and He Zhao and Chen Wei and Qiang Ren and Bryan Hooi and An Bo and Shuicheng Yan and Lidong Bing},
      year={2026},
      eprint={2603.28407},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.28407},
}

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Quick Start

1. Setup

2. Prepare Input

3. Run Evaluation

4. Results

Architecture

Evaluation Dimensions

Data Format

Input Queries (`data/input_queries/`)

Model Results

Task Generation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
eval		eval
factual_eval		factual_eval
point_quality		point_quality
process_eval		process_eval
static		static
task_generation		task_generation
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_eval.py		run_eval.py
run_eval.sh		run_eval.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Quick Start

1. Setup

2. Prepare Input

3. Run Evaluation

4. Results

Architecture

Evaluation Dimensions

Data Format

Input Queries (data/input_queries/)

Model Results

Task Generation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Input Queries (`data/input_queries/`)

Packages