EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

ArXi:2603.25267v1 Announce Type: new Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos.