CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents

ArXi:2603.11078v1 Announce Type: cross Recent advances in frontier large language models have enabled code review agents that operate in open-ended, reasoning-intensive settings. However, the lack of standardized benchmarks and granular evaluation protocols makes it difficult to assess behavior of code review agents beyond coarse success metrics, particularly for tasks where false positives are costly. To address this gap, we