MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

ArXi:2605.03398v1 Announce Type: new Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a