Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

ArXi:2509.24943v2 Announce Type: replace Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although existing Large Language Model (LLM)-based approaches have advanced long video understanding, they remain bottlenecked by task-agnostic, fixed-granularity perception pipelines and suffer from vision-language hallucinations.