ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

ArXi:2512.03666v2 Announce Type: replace-cross A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we