Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

ArXi:2605.03921v1 Announce Type: new We study the $(\varepsilon, \delta)$-PAC policy identification problem in finite-horizon episodic Marko Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/\delta)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the.