On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

ArXi:2605.06523v1 Announce Type: new Recent extensive research has nstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the