Semantic-Enriched Latent Visual Reasoning

ArXi:2605.19342v1 Announce Type: new Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to diverse region-level reasoning tasks. In this work, we