- [2024-06-14] We have released a patch to fix an issue in
$\tau$ -bench such that a trivial do-nothing agent will no-longer achieve scores as high as 38%! Thanks to @pruksmhc for the PR. - [2024-06-14] We have released a case study on OSWorld, where we identified and patched a task validity issue in OSWorld that impacts 13/46 problems in OSWorld's
chromesection! Thanks to @jacobmdsit for the PR. - [2024-05-15] UTBoost is accepted to ACL 2025! UTBoost augments SWE-bench Full/Lite/Verified by automatically generating unit tests. Feel free to check our code!
- [2024-05-14] We released a website for ABC!
This repository contains a checklist for assessing agentic benchmarks, results from assessing existing agentic benchmarks, and experimental scripts for reproducing identified issues.
agentic-benchmarks/
├── ABC.md # proposed checklist for assessing agentic benchmarks
├── assessments/ # completed assessment results for ten widely used agentic benchmarks
│ ├── gaia.yaml
│ ├── swe-bench.yaml
│ ├── ....
├── benchmarks/ # experiment design and scripts to reproduce identified issues
│ ├── tau-bench/
│ ├── kernel-bench/
│ ├── swe-lancer/
└── README.mdABC is composed of three sections:
- Outcome Validity: the success signal (e.g., tests or checks) truly indicates that the task has been completed
- Task Validity: a task should be solvable if and only if the agent possesses the target capability
- Benchmark Reporting: when guaranteeing outcome validity and task validity is particularly challenging or even impossible, benchmark developers should discuss such issues with quantitative evidence and provide guidelines on interpreting imperfect benchmarking results
We provide all items of ABC in ABC.md.
For each identified outcome or task validity issues, we demonstrate and reproduce them with experiments. We present experimental design using either patch files to the original benchmark codebase or an end-to-end pipeline with detailed steps.
Tau-Bench evaluates AI agents in environments like retail and airline operations. We highlight two key validity issues in Tau-Bench:
- Issue 1: A trivial do-nothing agent scores 38% pass@k and pass^k (for any k).
- Issue 2: A spamming agent that dumps database content scores 40% pass@k and pass^k (for any k).
- Issue 1: Apply
exploit-1.patchand run the trivial do-nothing agent. - Issue 2: Apply
exploit-2.patchand run the spamming agent.
Refer to benchmarks/tau-bench/README.md for detailed instructions.
Kernel-Bench tests the correctness of CUDA kernel functions generated by AI agents. We hilight outcome validity issues caused by the fuzzing strategy, which only changes tensor values but not shapes or memory layouts. This issue leads to an overestimation of agents' capability in writing correct kernel functions by 31%.
Follow the steps in benchmarks/kernel-bench/README.md to reproduce and analyze the issues.
SWE-Lancer evaluates agents' ability to implement features and fix bugs. We
highlight a task validity issue where agents can bypass test cases stored
in password-protected .zip files.
To address the issue, double-zip the test cases with passwords. See benchmarks/swe-lancer/README.md for details.
WebArena evaluates agents' ability to interact with websites. We highlight outcome validity issues where agent pass the string-matching or LLM-as-a-Judge evaluation without resolving users' requests.
Follow the steps in benchmarks/webarena/README.md to reproduce and analyze the issues.
OSWorld evaluates agents' ability to interact with a computer. We highlight a task validity issue where an agent cannot pass the evaluation even though it has correctly completed the task.
To fix the issue, revise the evaluation scripts according to the updated external websites. See benchmarks/osworld/README.md
Contributions are welcome! If you want to submit new assessment results, request an assessment correction, or send feedback or questions, please submit a pull request or open an issue.