Factored Causal Representation Learning for Robust Reward Modeling in RLHF

ArXi:2601.21350v2 Announce Type: replace A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior.