WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

ArXi:2512.02425v2 Announce Type: replace-cross Recent advances in video large language models have nstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes.