Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

ArXi:2605.14530v1 Announce Type: new Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes.