VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

ArXi:2604.03701v1 Announce Type: new Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching.