Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

ArXi:2604.04218v1 Announce Type: cross Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ($\eta_{t}\equi \eta$) or polynomially decaying ($\eta_{t} = \eta t^{-\alpha}$) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence.