| Research Paper | Blog Post |
Computer-use agents (CUAs) are often unusable due to extreme end-to-end latency—taking tens of minutes for tasks humans complete in just a few. We present the first temporal performance study of computer-use agents on OSWorld and find that large model calls for planning and reflection dominate latency, with later steps taking up to 3× longer than earlier ones. To measure efficiency, we introduce OSWorld-Human, a manually annotated version of OSWorld with human reference trajectories. Evaluating 16 agents, we find even top performers take 1.4–2.7× more steps than necessary.
- July 07, 2025: OSWorld-Human blog post available on mlsys.wuklab.io
- June 19, 2025: OSWorld-Human research paper available on arXiv.
- June 09, 2025: 🎉 Our paper has been accepted to the Workshop on Computer-Use Agents at ICML 2025! See you in Vancouver!
| Agent (Max Steps) | Original OSWorld (%) | Single-Action WES+ (%) | Grouped-Action WES+ (%) | WES- |
|---|---|---|---|---|
| UI-TARS-1.5 (100) | 42.5 | 23.7 | 14.3 | -0.22 |
| Agent S2 w/ Gemini 2.5 (50) | 41.4 | 28.2 | 17.4 | -0.26 |
| InfantAgent (50) | 35.3 | 13.3 | 8.2 | -0.22 |
| Agent S2 w/ Claude 3.7 (50) | 34.5 | 20.0 | 11.4 | -0.42 |
| UI-TARS-1.5 7B (100) | 26.9 | 12.4 | 7.9 | -0.33 |
| UI-TARS-72B-DPO (50) | 24.6 | 15.6 | 10.6 | -0.16 |
To compute your agent's score on OSWorld-Human, simply provide the path to the result directory generated by OSWorld and the maximum number of steps your agent could use.
python score.py --result-path /path/to/results/ --max-steps-scoring 50If you would like to score the UI-TARS trajectories that have been submitted to OSWorld, add the --uitars flag to the command.
@misc{abhyankar2025osworldhumanbenchmarkingefficiencycomputeruse,
title={OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents},
author={Reyna Abhyankar and Qi Qi and Yiying Zhang},
year={2025},
eprint={2506.16042},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.16042},
}