SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

ArXi:2604.11530v1 Announce Type: cross Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms.