AI RESEARCH
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
arXiv CS.AI
•
ArXi:2605.05040v1 Announce Type: cross On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level