Agentic Benchmark Checklist: Establishing Best Practices in Building Rigorous AI Agent Benchmarks

📰 News

[2024-06-14] We have released a patch to fix an issue in $\tau$-bench such that a trivial do-nothing agent will no-longer achieve scores as high as 38%! Thanks to @pruksmhc for the PR.
[2024-06-14] We have released a case study on OSWorld, where we identified and patched a task validity issue in OSWorld that impacts 13/46 problems in OSWorld's chrome section! Thanks to @jacobmdsit for the PR.
[2024-05-15] UTBoost is accepted to ACL 2025! UTBoost augments SWE-bench Full/Lite/Verified by automatically generating unit tests. Feel free to check our code!
[2024-05-14] We released a website for ABC!

🗺️ Overview

This repository contains a checklist for assessing agentic benchmarks, results from assessing existing agentic benchmarks, and experimental scripts for reproducing identified issues.

agentic-benchmarks/
├── ABC.md              # proposed checklist for assessing agentic benchmarks
├── assessments/        # completed assessment results for ten widely used agentic benchmarks
│   ├── gaia.yaml
│   ├── swe-bench.yaml
│   ├── ....
├── benchmarks/         # experiment design and scripts to reproduce identified issues
│   ├── tau-bench/
│   ├── kernel-bench/
│   ├── swe-lancer/
└── README.md

🔍 Agentic Benchmark Checklist (ABC)

ABC is composed of three sections:

Outcome Validity: the success signal (e.g., tests or checks) truly indicates that the task has been completed
Task Validity: a task should be solvable if and only if the agent possesses the target capability
Benchmark Reporting: when guaranteeing outcome validity and task validity is particularly challenging or even impossible, benchmark developers should discuss such issues with quantitative evidence and provide guidelines on interpreting imperfect benchmarking results

We provide all items of ABC in ABC.md.

🚀 Newly Identified Issues

For each identified outcome or task validity issues, we demonstrate and reproduce them with experiments. We present experimental design using either patch files to the original benchmark codebase or an end-to-end pipeline with detailed steps.

1. Tau-Bench

Tau-Bench evaluates AI agents in environments like retail and airline operations. We highlight two key validity issues in Tau-Bench:

Issue 1: A trivial do-nothing agent scores 38% pass@k and pass^k (for any k).
Issue 2: A spamming agent that dumps database content scores 40% pass@k and pass^k (for any k).

Reproducing Issues

Issue 1: Apply exploit-1.patch and run the trivial do-nothing agent.
Issue 2: Apply exploit-2.patch and run the spamming agent.

Refer to benchmarks/tau-bench/README.md for detailed instructions.

2. Kernel-Bench

Kernel-Bench tests the correctness of CUDA kernel functions generated by AI agents. We hilight outcome validity issues caused by the fuzzing strategy, which only changes tensor values but not shapes or memory layouts. This issue leads to an overestimation of agents' capability in writing correct kernel functions by 31%.

Reproducing Issues

Follow the steps in benchmarks/kernel-bench/README.md to reproduce and analyze the issues.

3. SWE-Lancer

SWE-Lancer evaluates agents' ability to implement features and fix bugs. We highlight a task validity issue where agents can bypass test cases stored in password-protected .zip files.

Reproducing and Fixing the Issue

To address the issue, double-zip the test cases with passwords. See benchmarks/swe-lancer/README.md for details.

4. WebArena

WebArena evaluates agents' ability to interact with websites. We highlight outcome validity issues where agent pass the string-matching or LLM-as-a-Judge evaluation without resolving users' requests.

Reproducing the Issues

Follow the steps in benchmarks/webarena/README.md to reproduce and analyze the issues.

5. OSWorld

OSWorld evaluates agents' ability to interact with a computer. We highlight a task validity issue where an agent cannot pass the evaluation even though it has correctly completed the task.

Reproducing and Fixing the Issue

To fix the issue, revise the evaluation scripts according to the updated external websites. See benchmarks/osworld/README.md

Contributing

Contributions are welcome! If you want to submit new assessment results, request an assessment correction, or send feedback or questions, please submit a pull request or open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assessments		assessments
benchmarks		benchmarks
.gitignore		.gitignore
.gitmodules		.gitmodules
ABC.md		ABC.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agentic Benchmark Checklist: Establishing Best Practices in Building Rigorous AI Agent Benchmarks

📰 News

🗺️ Overview

🔍 Agentic Benchmark Checklist (ABC)

🚀 Newly Identified Issues

1. Tau-Bench

Reproducing Issues

2. Kernel-Bench

Reproducing Issues

3. SWE-Lancer

Reproducing and Fixing the Issue

4. WebArena

Reproducing the Issues

5. OSWorld

Reproducing and Fixing the Issue

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

uiuc-kang-lab/agentic-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Agentic Benchmark Checklist: Establishing Best Practices in Building Rigorous AI Agent Benchmarks

📰 News

🗺️ Overview

🔍 Agentic Benchmark Checklist (ABC)

🚀 Newly Identified Issues

1. Tau-Bench

Reproducing Issues

2. Kernel-Bench

Reproducing Issues

3. SWE-Lancer

Reproducing and Fixing the Issue

4. WebArena

Reproducing the Issues

5. OSWorld

Reproducing and Fixing the Issue

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages