Proximal Point Nash Learning from Human Feedback

ArXi:2505.19731v2 Announce Type: replace-cross Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley--Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences.