SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

ArXi:2602.11656v2 Announce Type: replace In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios.