The Unlearnability Phenomenon in RLVR for Language Models

ArXi:2605.16787v1 Announce Type: new Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first nstrate that existing optimization and sampling techniques fail to resolve unlearnability.