STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

ArXi:2602.15620v4 Announce Type: replace-cross Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable