TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

ArXi:2602.04657v3 Announce Type: replace Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment.