Vision-aligned Latent Reasoning for Multi-modal Large Language Model

ArXi:2602.04476v2 Announce Type: replace Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we