TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

ArXi:2603.06687v1 Announce Type: cross Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we.