EVA: Efficient Reinforcement Learning for End-to-End Video Agent

ArXi:2603.22918v1 Announce Type: cross Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods