AI RESEARCH

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

arXiv CS.CL

ArXi:2605.13643v1 Announce Type: new On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we nstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings.