Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

ArXi:2603.29252v1 Announce Type: cross Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and