RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

ArXi:2512.12083v3 Announce Type: replace Semantic-rich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing