Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

ArXi:2605.01823v1 Announce Type: new Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-of-the-art 1-shot RLVR models adopt heuristics for selecting instances, mostly based on historical variance in rewards, which we find to be inherently misleading as a measure of transferability value.