CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

ArXi:2605.14310v1 Announce Type: new Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history.