AI RESEARCH
I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]
r/MachineLearning
•
DPO (Rafailo, NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the