AI RESEARCH
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
arXiv CS.AI
•
ArXi:2605.06161v1 Announce Type: new LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded.