Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

ArXi:2512.23461v2 Announce Type: replace-cross Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM