iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

ArXi:2603.02748v2 Announce Type: replace Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. i.