Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

ArXi:2505.04842v2 Announce Type: replace-cross Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. Yet if parallel test-time compute is already part of the deployment plan