AI RESEARCH
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
arXiv CS.LG
•
ArXi:2605.10784v1 Announce Type: new Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We