AI RESEARCH

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

arXiv CS.AI

ArXi:2604.15302v1 Announce Type: new LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood.