AI RESEARCH

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

arXiv CS.CV

ArXi:2601.14594v2 Announce Type: replace Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames.