CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

ArXi:2605.07905v1 Announce Type: cross Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness.