Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

ArXi:2512.12623v3 Announce Type: replace-cross Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation.