Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

ArXi:2601.13719v2 Announce Type: replace-cross Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search.