Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

ArXi:2604.19502v1 Announce Type: new The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we