ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

ArXi:2601.06559v2 Announce Type: replace Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization.