GroupDPO: Memory efficient Group-wise Direct Preference Optimization

ArXi:2604.15602v1 Announce Type: new Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses.