Leveraging Latent Visual Reasoning in Silence

ArXi:2605.18641v1 Announce Type: new Latent visual reasoning involves visual evidence directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-