Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

ArXi:2506.07527v3 Announce Type: replace Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information.