Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

ArXi:2605.09429v1 Announce Type: cross Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues.