Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

ArXi:2601.21244v3 Announce Type: replace Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable