Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

ArXi:2603.06577v1 Announce Type: new While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems.