Strongly-polynomial time and validation analysis of policy gradient methods

ArXi:2409.19437v5 Announce Type: replace-cross This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Marko decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we nstrate that policy gradient methods can solve MDPs in strongly-polynomial time.