Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

ArXi:2510.13918v2 Announce Type: replace Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection.