Gaia2 CLI is the container-based evaluation stack for the Gaia2 benchmark. It packages the Gaia2 app CLIs, launches agent runtimes behind a shared HTTP contract, grades tool use against the scenario oracle, and generates inspectable traces.
For the broader benchmark description and scenario concepts, see the
Gaia2 evaluation guide and the
scenario foundations.
podman- Python
3.12 uv- network access to Hugging Face and your chosen model provider
ANTHROPIC_API_KEYif you want to use the ready-made Anthropic configs below
cp .env.example .envThe runner auto-loads .env from this directory. For the included quickstart
configs, set ANTHROPIC_API_KEY.
If outbound access to providers or Hugging Face is restricted, you can also set:
GAIA2_PROXY_RELAY_URLGAIA2_CA_BUNDLEhttp_proxy/https_proxyorHTTP_PROXY/HTTPS_PROXY- optional
HF_TOKENto reduce Hugging Face throttling or rate limits
You can build all images with make all. Each target builds the shared base
image (gaia2-cli) plus the selected runtime. For a quickstart run, you only
need one image:
make gaia2-hermes
make gaia2-ocrunner/examples/quickstart_hermes.tomlif you want the same Anthropicsearchpass@1 setup on Hermesrunner/examples/quickstart_openclaw.tomlif you want the same Anthropicsearchpass@1 setup on OpenClaw
Both use claude-sonnet-4-6 for the agent and judge, so one key is enough to
get started. The judge is swappable. The shipped quickstart configs use
Anthropic Sonnet 4.6 as the judge for operational simplicity. You can point it
at a smaller model if you want lower cost, but score behavior may shift.
Separately, we calibrated gpt-oss-120b with low reasoning as the reference
judge configuration.
# Hermes
uv run --project runner --python 3.12 gaia2-runner run-config \
--config runner/examples/quickstart_hermes.toml
# OpenClaw
uv run --project runner --python 3.12 gaia2-runner run-config \
--config runner/examples/quickstart_openclaw.tomlThis downloads the public dataset meta-agents-research-environments/gaia2-cli automatically if needed and writes artifacts to the configured output_dir.
# Hermes
uv run --project runner --python 3.12 gaia2-runner serve \
--output-dir /tmp/gaia2_hermes_quickstart
# OpenClaw
uv run --project runner --python 3.12 gaia2-runner serve \
--output-dir /tmp/gaia2_openclaw_quickstartThe runner also writes a static index.html into the output directory, so you
can reopen finished runs later without keeping the server alive.
| Image | Use it when |
|---|---|
localhost/gaia2-hermes:latest |
You want the Hermes runtime with the same Gaia2 tool surface and trace collection |
localhost/gaia2-oc:latest |
OpenClaw runtime for OpenAI, Anthropic, Google, OpenRouter, or OpenAI-compatible endpoints |
localhost/gaia2-oracle:latest |
You want an oracle replay baseline or a fast runner and judge smoke test |
Other useful examples:
runner/examples/hermes_opus_gaia2_pass1.tomlfor Hermes with direct Anthropic Opus 4.6 on all public benchmark splits, pass@1runner/examples/hermes_sonnet_gaia2_pass1.tomlfor Hermes with direct Anthropic Sonnet 4.6 on all public benchmark splits, pass@1runner/examples/hermes_google_gaia2_pass1.tomlfor Hermes with direct Google AI Studio Gemini 3.1 Pro Preview on all public benchmark splits, pass@1runner/examples/hermes_gpt54_gaia2_pass1.tomlfor Hermes with direct OpenAI GPT-5.4 on all public benchmark splits, pass@1runner/examples/openclaw_opus_gaia2_pass1.tomlfor OpenClaw with direct Anthropic Opus 4.6 on all public benchmark splits, pass@1runner/examples/openclaw_sonnet_gaia2_pass1.tomlfor OpenClaw with direct Anthropic Sonnet 4.6 on all public benchmark splits, pass@1runner/examples/openclaw_google_gaia2_pass1.tomlfor OpenClaw with direct Google AI Studio Gemini 3.1 Pro Preview on all public benchmark splits, pass@1runner/examples/openclaw_gpt54_gaia2_pass1.tomlfor OpenClaw with direct OpenAI GPT-5.4 on all public benchmark splits, pass@1runner/examples/template_hermes_openai_compat.tomlas a generic Hermes template for custom OpenAI chat-completions-compatible endpointsrunner/examples/template_openclaw_openai_compat.tomlas a generic OpenClaw template for custom OpenAI chat-completions-compatible endpoints
The full-benchmark Hermes and OpenClaw configs above keep the judge on
Anthropic Sonnet 4.6 by default. Set the provider API key that matches the
agent config you want to run, plus ANTHROPIC_API_KEY for the judge.
Run any edited config with:
uv run --project runner --python 3.12 gaia2-runner run-config \
--config runner/examples/<your-config>.tomlmake help
make all
make verify
make test
uv run --project runner --python 3.12 gaia2-runner --helpgaia2-cli/
├── cli/ # Gaia2 app CLIs, daemon, in-container judge
├── core/ # Shared event-loop and judging primitives
├── runner/ # Host-side launcher, config loader, trace viewer
├── shared/ # Adapter base, exec wrapper, prompt rendering helpers
├── containers/ # OpenClaw, Hermes, and Oracle runtime images
├── scripts/ # Repo utilities such as dataset export
└── Makefile # Build, verify, and test entrypoints
- runner/README.md for the full runner workflow, config format, and CLI details
- runner/TRACE_FORMAT.md for the raw trace contract
- containers/openclaw/README.md for OpenClaw internals and debugging
- containers/hermes/README.md for Hermes internals and debugging
- containers/oracle/README.md for Oracle replay internals and debugging
See LICENSE in the repository root.
If you use Gaia2 CLI in your work, please cite:
@misc{froger2026gaia2benchmarkingllmagents,
title={Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments},
author={Romain Froger and Pierre Andrews and Matteo Bettini and Amar Budhiraja and Ricardo Silveira Cabral and Virginie Do and Emilien Garreau and Jean-Baptiste Gaya and Hugo Laurençon and Maxime Lecanu and
Kunal Malkan and Dheeraj Mekala and Pierre Ménard and Gerard Moreno-Torres Bertran and Ulyana Piterbarg and Mikhail Plekhanov and Mathieu Rita and Andrey Rusakov and Vladislav Vorotilov and Mengjue Wang and Ian Yu
and Amine Benhalloum and Grégoire Mialon and Thomas Scialom},
year={2026},
eprint={2602.11964},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.11964},
}