Relative Entropy Pathwise Policy Optimization

ArXi:2507.11019v4 Announce Type: replace Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines