STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

ArXi:2603.27593v1 Announce Type: cross Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond.