Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

ArXi:2605.01662v1 Announce Type: new Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we