AI RESEARCH
Optimal Transport for LLM Reward Modeling from Noisy Preference
arXiv CS.LG
•
ArXi:2605.06036v1 Announce Type: new Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional