Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

ArXi:2512.05962v2 Announce Type: replace-cross Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others.