From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

ArXi:2604.12508v1 Announce Type: new While Multimodal Large Language Models (MLLMs) have nstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process.