MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

ArXi:2604.16972v1 Announce Type: new Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have nstrated strong performance and high