OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

ArXi:2506.16042v2 Announce Type: replace-cross Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete.