Improved Bounds for Reward-Agnostic and Reward-Free Exploration

ArXi:2602.16363v2 Announce Type: replace We study reward-free and reward-agnostic exploration in episodic finite-horizon Marko decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $\epsilon$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $\epsilon$-optimality for rewards drawn from a small finite class.