Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

ArXi:2605.14270v1 Announce Type: new Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we nstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts.