AI RESEARCH

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

arXiv CS.AI

ArXi:2605.05040v1 Announce Type: cross On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level