AI RESEARCH
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
arXiv CS.LG
•
ArXi:2604.02288v1 Announce Type: new Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-