AI RESEARCH

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

arXiv CS.LG

ArXi:2604.02288v1 Announce Type: new Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-