StreamingTOM: Streaming Token Compression for Efficient Video Understanding

ArXi:2510.18269v2 Announce Type: replace-cross Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM k-cache, leaving costly pre-LLM prefill unchanged. We