Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

ArXi:2511.01718v2 Announce Type: replace-cross Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions.