AI RESEARCH
Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
arXiv CS.CV
•
ArXi:2512.05513v3 Announce Type: replace Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence.