Multi-Scale Contrastive Learning for Video Temporal Grounding

ArXi:2412.07157v3 Announce Type: replace Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments.