Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

ArXi:2506.16658v2 Announce Type: replace-cross Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We