Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

ArXi:2603.26646v1 Announce Type: new Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we