SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

ArXi:2603.08224v1 Announce Type: new For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to re