RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

ArXi:2605.11151v1 Announce Type: new Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions.