Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

ArXi:2605.07137v1 Announce Type: cross Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of complex frameworks like PPO and GRPO across the entire Pass spectrum. However, current NSR techniques usually apply a fixed penalty throughout the