A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

ArXi:2603.06594v1 Announce Type: new Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks.