Process Supervision of Confidence Margin for Calibrated LLM Reasoning

ArXi:2604.23333v1 Announce Type: new Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We