Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

ArXi:2603.14987v1 Announce Type: new As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings.