CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

ArXi:2604.12312v1 Announce Type: new As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored.