CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

ArXi:2602.02979v2 Announce Type: replace Large Language Models (LLMs) have nstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy