Influencing Humans to Conform to Preference Models for RLHF

ArXi:2501.06416v3 Announce Type: replace-cross Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to closely conform to a desired preference model.