Information Router for Mitigating Modality Dominance in Vision-Language Models

ArXi:2604.16264v1 Announce Type: cross Vision Language models (VLMs) have nstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous.