Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

ArXi:2603.22214v1 Announce Type: cross A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and consistent judgments compared to human reviewers.