Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

ArXi:2604.08014v1 Announce Type: new Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs