Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

ArXi:2211.07092v5 Announce Type: replace-cross In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Marko chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties.