Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

ArXi:2502.16435v4 Announce Type: replace-cross Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we