A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

ArXi:2604.02860v1 Announce Type: cross Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for