FLARE: Learning Future-Aware Latent Representations from Vision-Language Models for Autonomous Driving

ArXi:2601.05611v2 Announce Type: replace While Vision-Language Models (VLMs) offer rich world knowledge for end-to-end autonomous driving, current approaches heavily rely on labor-intensive language annotations (e.g., VQA) to bridge perception and control. This paradigm suffers from a fundamental mismatch between discrete linguistic tokens and continuous driving trajectories, often leading to suboptimal control policies and inefficient utilization of pre-trained knowledge.