Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

ArXi:2512.04000v2 Announce Type: replace-cross The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query.