CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

ArXi:2603.28569v1 Announce Type: cross The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction.