Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

ArXi:2601.23251v2 Announce Type: replace State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily.