Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

ArXi:2601.18734v3 Announce Type: replace Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between