A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

ArXi:2510.16132v2 Announce Type: replace In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Marko decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Marko chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $\mathcal{O}(1/\xi^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi.