Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

ArXi:2507.16806v2 Announce Type: replace-cross When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs.