SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

ArXi:2603.10250v1 Announce Type: new A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we