SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

ArXi:2604.05079v1 Announce Type: new Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have