Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

ArXi:2506.20904v2 Announce Type: replace We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time.