Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

ArXi:2507.18756v2 Announce Type: replace Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies.