VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

ArXi:2604.01569v1 Announce Type: new Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence ing their predictions.