StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

ArXi:2512.01707v2 Announce Type: replace-cross Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we