Tail Distribution of Regret in Optimistic Reinforcement Learning

ArXi:2511.18247v3 Announce Type: replace We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Marko decision processes with unknown transition dynamics. We first study a UCBVI-type (model-based) algorithm and characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes via explicit bounds on $P(R_K \ge x)$, going beyond analyses limited to $E[R_K]$ or a single high-probability quantile.