Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

ArXi:2603.14953v1 Announce Type: cross Large multimodal models (LMMs) have recently nstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity.