From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models

ArXi:2602.08336v2 Announce Type: replace Unified multimodal models (UMMs) aim to integrate multimodal understanding and generation within a unified architecture, yet it remains unclear to what extent their representations are truly aligned across modalities. To investigate this question, we use reasoning-guided image generation as a diagnostic task, where models produce textual reasoning first and then generate images. We