Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

ArXi:2605.06474v1 Announce Type: new We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class.