ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

ArXi:2605.00380v1 Announce Type: new Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses.