Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

ArXi:2603.23926v1 Announce Type: new Online reinforcement learning in infinite-horizon Marko decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret.