TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

ArXi:2605.10983v1 Announce Type: cross Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively cons