Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

ArXi:2605.13080v1 Announce Type: new When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we