AIPO: : Learning to Reason from Active Interaction

ArXi:2605.08401v1 Announce Type: cross Recent advances in large language models (LLMs) have nstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods