OSWorld-Human

Computer-use agents (CUAs) are often unusable due to extreme end-to-end latency—taking tens of minutes for tasks humans complete in just a few. We present the first temporal performance study of computer-use agents on OSWorld and find that large model calls for planning and reflection dominate latency, with later steps taking up to 3× longer than earlier ones. To measure efficiency, we introduce OSWorld-Human, a manually annotated version of OSWorld with human reference trajectories. Evaluating 16 agents, we find even top performers take 1.4–2.7× more steps than necessary.

News

July 07, 2025: OSWorld-Human blog post available on mlsys.wuklab.io
June 19, 2025: OSWorld-Human research paper available on arXiv.
June 09, 2025: 🎉 Our paper has been accepted to the Workshop on Computer-Use Agents at ICML 2025! See you in Vancouver!

🏆 Leaderboard (Updated 6/30)

Agent (Max Steps)	Original OSWorld (%)	Single-Action WES+ (%)	Grouped-Action WES+ (%)	WES-
UI-TARS-1.5 (100)	42.5	23.7	14.3	-0.22
Agent S2 w/ Gemini 2.5 (50)	41.4	28.2	17.4	-0.26
InfantAgent (50)	35.3	13.3	8.2	-0.22
Agent S2 w/ Claude 3.7 (50)	34.5	20.0	11.4	-0.42
UI-TARS-1.5 7B (100)	26.9	12.4	7.9	-0.33
UI-TARS-72B-DPO (50)	24.6	15.6	10.6	-0.16

Usage

To compute your agent's score on OSWorld-Human, simply provide the path to the result directory generated by OSWorld and the maximum number of steps your agent could use.

python score.py --result-path /path/to/results/ --max-steps-scoring 50

If you would like to score the UI-TARS trajectories that have been submitted to OSWorld, add the --uitars flag to the command.

Citation

@misc{abhyankar2025osworldhumanbenchmarkingefficiencycomputeruse,
      title={OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents}, 
      author={Reyna Abhyankar and Qi Qi and Yiying Zhang},
      year={2025},
      eprint={2506.16042},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.16042}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
chrome		chrome
gimp		gimp
libreoffice_calc		libreoffice_calc
libreoffice_impress		libreoffice_impress
libreoffice_writer		libreoffice_writer
multi_apps		multi_apps
os		os
thunderbird		thunderbird
vlc		vlc
vs_code		vs_code
.gitignore		.gitignore
README.md		README.md
score.py		score.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OSWorld-Human

News

🏆 Leaderboard (Updated 6/30)

Usage

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

WukLab/osworld-human

Folders and files

Latest commit

History

Repository files navigation

OSWorld-Human

News

🏆 Leaderboard (Updated 6/30)

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages