AI RESEARCH

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

arXiv CS.LG

ArXi:2605.10784v1 Announce Type: new Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We