Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

ArXi:2503.02597v3 Announce Type: replace-cross Recent Multimodal Large Language Models (MLLMs) have nstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs.