PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning

ArXi:2602.03190v3 Announce Type: replace-cross Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. While a growing body of work seeks to improve