OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

ArXi:2604.10866v1 Announce Type: new AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We