[R] Evaluating MLLMs with Child-Inspired Cognitive Tasks

Hey there, we’re sharing KidGym, an interactive 2D grid-based benchmark for evaluating MLLMs in continuous, trajectory-based interaction, accepted to ICLR 2026. Motivation: Many existing MLLM benchmarks are static and focus on isolated skills, which makes them less faithful for characterizing model capabilities in continuous interactive settings. Inspired by the Wechsler Intelligence Scale for Children (WISC), we organize evaluation into five cognitive dimensions and design tasks to probe both single abilities and compositional abilities.