VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

ArXi:2507.13353v2 Announce Type: replace-cross While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization.