T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

ArXi:2603.06973v1 Announce Type: new Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing Vision-LMMs typically perceive temporal dynamics via positional encoding, text-based, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token