CocoaBench: Evaluating Unified Digital Agents in the Wild

ArXi:2604.11201v1 Announce Type: cross LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for diverse use cases that require agents to combine different capabilities. We