CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

ArXi:2605.08873v1 Announce Type: new Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and