Rethinking Token Reduction for Large Vision-Language Models

ArXi:2603.21701v1 Announce Type: cross Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT