AI RESEARCH

Computer Use at the Edge of the Statistical Precipice

arXiv CS.AI

ArXi:2605.08261v1 Announce Type: cross Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent's pass in deterministic environments.