AI RESEARCH

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

arXiv CS.LG

ArXi:2605.01347v1 Announce Type: cross On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize