LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

ArXi:2604.20796v1 Announce Type: new We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that s multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images.