Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

ArXi:2603.15167v1 Announce Type: new In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and. therefore. fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench.