LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

ArXi:2601.14594v2 Announce Type: replace Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames.