HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

ArXi:2603.18850v1 Announce Type: new Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We