Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

ArXi:2603.02667v2 Announce Type: replace-cross Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We