Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

ArXi:2505.19155v2 Announce Type: replace-cross Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we.