CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

ArXi:2603.18736v1 Announce Type: cross Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we