Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

ArXi:2604.06832v1 Announce Type: new Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized.