Skip to content

uiuc-kang-lab/agentic-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Benchmark Checklist: Establishing Best Practices in Building Rigorous AI Agent Benchmarks

📰 News

  • [2024-06-14] We have released a patch to fix an issue in $\tau$-bench such that a trivial do-nothing agent will no-longer achieve scores as high as 38%! Thanks to @pruksmhc for the PR.
  • [2024-06-14] We have released a case study on OSWorld, where we identified and patched a task validity issue in OSWorld that impacts 13/46 problems in OSWorld's chrome section! Thanks to @jacobmdsit for the PR.
  • [2024-05-15] UTBoost is accepted to ACL 2025! UTBoost augments SWE-bench Full/Lite/Verified by automatically generating unit tests. Feel free to check our code!
  • [2024-05-14] We released a website for ABC!

🗺️ Overview

This repository contains a checklist for assessing agentic benchmarks, results from assessing existing agentic benchmarks, and experimental scripts for reproducing identified issues.

agentic-benchmarks/
├── ABC.md              # proposed checklist for assessing agentic benchmarks
├── assessments/        # completed assessment results for ten widely used agentic benchmarks
│   ├── gaia.yaml
│   ├── swe-bench.yaml
│   ├── ....
├── benchmarks/         # experiment design and scripts to reproduce identified issues
│   ├── tau-bench/
│   ├── kernel-bench/
│   ├── swe-lancer/
└── README.md

🔍 Agentic Benchmark Checklist (ABC)

ABC is composed of three sections:

  1. Outcome Validity: the success signal (e.g., tests or checks) truly indicates that the task has been completed
  2. Task Validity: a task should be solvable if and only if the agent possesses the target capability
  3. Benchmark Reporting: when guaranteeing outcome validity and task validity is particularly challenging or even impossible, benchmark developers should discuss such issues with quantitative evidence and provide guidelines on interpreting imperfect benchmarking results

We provide all items of ABC in ABC.md.

🚀 Newly Identified Issues

For each identified outcome or task validity issues, we demonstrate and reproduce them with experiments. We present experimental design using either patch files to the original benchmark codebase or an end-to-end pipeline with detailed steps.

1. Tau-Bench

Tau-Bench evaluates AI agents in environments like retail and airline operations. We highlight two key validity issues in Tau-Bench:

  • Issue 1: A trivial do-nothing agent scores 38% pass@k and pass^k (for any k).
  • Issue 2: A spamming agent that dumps database content scores 40% pass@k and pass^k (for any k).

Reproducing Issues

  • Issue 1: Apply exploit-1.patch and run the trivial do-nothing agent.
  • Issue 2: Apply exploit-2.patch and run the spamming agent.

Refer to benchmarks/tau-bench/README.md for detailed instructions.


2. Kernel-Bench

Kernel-Bench tests the correctness of CUDA kernel functions generated by AI agents. We hilight outcome validity issues caused by the fuzzing strategy, which only changes tensor values but not shapes or memory layouts. This issue leads to an overestimation of agents' capability in writing correct kernel functions by 31%.

Reproducing Issues

Follow the steps in benchmarks/kernel-bench/README.md to reproduce and analyze the issues.


3. SWE-Lancer

SWE-Lancer evaluates agents' ability to implement features and fix bugs. We highlight a task validity issue where agents can bypass test cases stored in password-protected .zip files.

Reproducing and Fixing the Issue

To address the issue, double-zip the test cases with passwords. See benchmarks/swe-lancer/README.md for details.


4. WebArena

WebArena evaluates agents' ability to interact with websites. We highlight outcome validity issues where agent pass the string-matching or LLM-as-a-Judge evaluation without resolving users' requests.

Reproducing the Issues

Follow the steps in benchmarks/webarena/README.md to reproduce and analyze the issues.


5. OSWorld

OSWorld evaluates agents' ability to interact with a computer. We highlight a task validity issue where an agent cannot pass the evaluation even though it has correctly completed the task.

Reproducing and Fixing the Issue

To fix the issue, revise the evaluation scripts according to the updated external websites. See benchmarks/osworld/README.md


Contributing

Contributions are welcome! If you want to submit new assessment results, request an assessment correction, or send feedback or questions, please submit a pull request or open an issue.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •