MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

ArXi:2605.12624v1 Announce Type: cross Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined.