HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

ArXi:2601.14724v3 Announce Type: replace-cross Recent advancements in Multimodal Large Language Models (MLLMs) have nstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel