Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

ArXi:2512.06673v2 Announce Type: replace Multimodal large language models (MLLMs) are rapidly expanding from general video understanding to finer-grained understanding such as spatio-temporal video grounding (STVG) and reasoning. In these tasks, an MLLM must localize the user-queried target in time and space and take the results as evidence for reasoning.