GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

ArXi:2605.15764v1 Announce Type: cross Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We