Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

ArXi:2604.17912v1 Announce Type: new State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts.