ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

ArXi:2604.02022v2 Announce Type: replace Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We