Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

ArXi:2504.19342v3 Announce Type: replace-cross Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm in artificial intelligence to align large models with human preferences. In this paper, we propose a novel statistical framework to simultaneously conduct the online decision-making and statistical inference on the optimal model using human preference data based on dynamic contextual information. Our approach