AI RESEARCH

RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

arXiv CS.AI

ArXi:2603.25133v1 Announce Type: new Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we