Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

ArXi:2605.00826v1 Announce Type: cross Text-to-video retrieval enables users to find relevant video content using natural language queries, a task that has grown increasingly important with the rapid expansion of online video. Over the past six years, research has produced numerous methods, such as dual encoders, attention-driven models, and multimodal fusion approaches; however, fundamental questions remain about model behavior, dataset influence, and query difficulty.