AI RESEARCH
Bounded Ratio Reinforcement Learning
arXiv CS.LG
•
ArXi:2604.18578v1 Announce Type: new Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by