Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

ArXi:2605.14779v1 Announce Type: new We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($\lambda$) (CPQL). Our algorithm adapts the Peng's Q($\lambda$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically nstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories.