Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

ArXi:2409.17138v2 Announce Type: replace-cross Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization poses significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Marko Decision Processes (MDPs) with general state and action spaces, we identify a set of structural properties to establish a benign nonconvex landscape, the Polyak-{\L}ojasiewicz-Kurdyka (P{\L}K) condition of the policy optimization.