Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

ArXi:2605.19768v1 Announce Type: new We study reinforcement learning for episodic Marko Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li, 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille, 2021; Faury, 2022; Boudart, 2026), we