VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

ArXi:2601.15724v2 Announce Type: replace Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments.