AI RESEARCH
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
arXiv CS.CL
•
ArXi:2506.18485v3 Announce Type: replace Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns.