Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

ArXi:2603.17693v1 Announce Type: new The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-