From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

ArXi:2605.02130v1 Announce Type: new Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we