ViLL-E: Video LLM Embeddings for Retrieval

ArXi:2604.12148v1 Announce Type: new Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We