AI RESEARCH
HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
arXiv CS.LG
•
ArXi:2603.23871v1 Announce Type: new Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We