AI RESEARCH
Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
arXiv CS.LG
•
ArXi:2601.21244v3 Announce Type: replace Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable