ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

ArXi:2604.05172v1 Announce Type: new Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We