A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

ArXi:2603.26547v1 Announce Type: new We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore to the standard discrete time setup. As in continuous time, we prove that with learning rate $\eta = O(\Delta_{\min}^2/(\Delta_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / \eta)$ where $n$ is the horizon and $\Delta_{\min}$ and $\Delta_{\max}$ are the minimum and maximum gaps.