Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

ArXi:2605.07316v1 Announce Type: new Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the.