AI RESEARCH
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
arXiv CS.AI
•
ArXi:2604.15302v1 Announce Type: new LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood.