Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

ArXi:2603.11137v1 Announce Type: new On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we