Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

ArXi:2603.00512v2 Announce Type: replace Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we