A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

ArXi:2603.14733v1 Announce Type: new Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which