DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

ArXi:2602.18846v2 Announce Type: replace-cross Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed.