AI RESEARCH

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

arXiv CS.AI

ArXi:2605.19218v1 Announce Type: cross Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves visual tokens at the same memory cost.