Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

ArXi:2604.02071v1 Announce Type: cross Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to