AI RESEARCH

How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

arXiv CS.CV

ArXi:2604.08966v1 Announce Type: new While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and