On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

ArXi:2605.02141v1 Announce Type: new Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs