DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

ArXi:2604.11119v1 Announce Type: cross This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark.