Optimal Transport for LLM Reward Modeling from Noisy Preference

ArXi:2605.06036v1 Announce Type: new Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional