MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

ArXi:2605.14906v1 Announce Type: new Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we