PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

ArXi:2508.21787v2 Announce Type: replace-cross Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple