GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

ArXi:2511.06348v2 Announce Type: replace-cross Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper