Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits

ArXi:2604.14961v1 Announce Type: cross Contextual bandit algorithms suffer from high regret during cold-start, when the learner has insufficient data to distinguish good arms from bad. We propose augmenting Disjoint LinUCB with LLM pseudo-observations: after each round, a large language model predicts counterfactual rewards for the unplayed arms, and these predictions are injected into the learner as weighted pseudo-observations.