Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

ArXi:2512.10362v2 Announce Type: replace Multimodal Large Language Models (MLLMs) nstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they