ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

ArXi:2603.23184v1 Announce Type: cross Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative.