AI RESEARCH

ShaRP: SHAllow-LayeR Pruning for Efficient Video Large Language Models

arXiv CS.CV

ArXi:2512.05385v2 Announce Type: replace Video Large Language Models (VLLMs) incur substantial prefilling cost due to the large number of visual tokens. While attention-based token pruning offers a promising acceleration strategy, applying it at shallow decoder layers often causes severe performance degradation under high compression ratios, limiting its practical benefits.