From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

ArXi:2603.09436v1 Announce Type: new We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator.