Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

ArXi:2604.08527v1 Announce Type: new On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as