AI RESEARCH

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

arXiv CS.LG

ArXi:2605.08666v1 Announce Type: new A commonly accepted explanation of critic-free RL for LLMs, based on sequence-level rewards, is that it reinforces successful rollouts with a positive advantage while penalizing failed ones. In contrast, we study critic-free RL from a token-level perspective, revealing the token-flipping phenomenon: positive and negative rollouts exhibit remarkably similar proportions of tokens whose probabilities are boosted or suppressed during RL