VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

ArXi:2605.16079v1 Announce Type: cross Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience.