HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

ArXi:2603.23871v1 Announce Type: new Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We