Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

ArXi:2604.22981v1 Announce Type: new Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model's output at any token should represent the conditional expectation of the final reward given the response so far. We