From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

ArXi:2604.04648v1 Announce Type: new Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected.