Skip to content

WukLab/osworld-human

Repository files navigation

OSWorld-Human

| Research Paper | Blog Post |

Computer-use agents (CUAs) are often unusable due to extreme end-to-end latency—taking tens of minutes for tasks humans complete in just a few. We present the first temporal performance study of computer-use agents on OSWorld and find that large model calls for planning and reflection dominate latency, with later steps taking up to 3× longer than earlier ones. To measure efficiency, we introduce OSWorld-Human, a manually annotated version of OSWorld with human reference trajectories. Evaluating 16 agents, we find even top performers take 1.4–2.7× more steps than necessary.

News

🏆 Leaderboard (Updated 6/30)

Agent (Max Steps) Original OSWorld (%) Single-Action WES+ (%) Grouped-Action WES+ (%) WES-
UI-TARS-1.5 (100) 42.5 23.7 14.3 -0.22
Agent S2 w/ Gemini 2.5 (50) 41.4 28.2 17.4 -0.26
InfantAgent (50) 35.3 13.3 8.2 -0.22
Agent S2 w/ Claude 3.7 (50) 34.5 20.0 11.4 -0.42
UI-TARS-1.5 7B (100) 26.9 12.4 7.9 -0.33
UI-TARS-72B-DPO (50) 24.6 15.6 10.6 -0.16

Usage

To compute your agent's score on OSWorld-Human, simply provide the path to the result directory generated by OSWorld and the maximum number of steps your agent could use.

python score.py --result-path /path/to/results/ --max-steps-scoring 50

If you would like to score the UI-TARS trajectories that have been submitted to OSWorld, add the --uitars flag to the command.

Citation

@misc{abhyankar2025osworldhumanbenchmarkingefficiencycomputeruse,
      title={OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents}, 
      author={Reyna Abhyankar and Qi Qi and Yiying Zhang},
      year={2025},
      eprint={2506.16042},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.16042}, 
}

About

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages