VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

ArXi:2508.06869v4 Announce Type: replace-cross Multimodal large language models (MLLMs) nstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone.