Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

ArXi:2604.01764v1 Announce Type: new Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted.