Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

ArXi:2512.05513v3 Announce Type: replace Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence.