Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

ArXi:2603.01400v2 Announce Type: replace Video Large Language Models (VLLMs) nstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens.