Time Blindness: Why Video-Language Models Can't See What Humans Can?

ArXi:2505.24867v2 Announce Type: replace-cross Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We