Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

ArXi:2603.06797v1 Announce Type: cross Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses.