Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

ArXi:2604.10074v1 Announce Type: new Transformer-based diffusion models have nstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the