GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

ArXi:2603.14041v1 Announce Type: new The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during