AI RESEARCH
Gradient Extrapolation-Based Policy Optimization
arXiv CS.AI
•
ArXi:2605.06755v1 Announce Type: cross Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style