Boosting Visual Instruction Tuning with Self-Supervised Guidance

ArXi:2604.12966v1 Announce Type: new Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone.