Video Understanding: From Geometry and Semantics to Unified Models

ArXi:2603.17840v1 Announce Type: new Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision.