Scaling Evaluation-time Compute with Reasoning Models as Evaluators

ArXi:2503.19877v2 Announce Type: replace As language model (LM) outputs get and natural, it is becoming difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators.