Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

ArXi:2603.12533v1 Announce Type: new Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we