Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

ArXi:2605.16354v1 Announce Type: new Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost.