Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

ArXi:2605.04494v1 Announce Type: new Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment.