Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies

ArXi:2602.23811v3 Announce Type: replace We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie, 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces.