PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

ArXi:2604.08991v1 Announce Type: cross Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we.