A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

ArXi:2603.14052v1 Announce Type: new This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration.