Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

ArXi:2604.11025v1 Announce Type: new Recent multimodal large language models (MLLMs) have begun to Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process.