AI RESEARCH
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
arXiv CS.AI
•
ArXi:2602.18846v2 Announce Type: replace-cross Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed.