VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

ArXi:2601.22674v3 Announce Type: replace Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for