Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

ArXi:2506.05412v3 Announce Type: replace-cross Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained.