Can Vision-Language Models Solve the Shell Game?

ArXi:2603.08436v1 Announce Type: cross Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We