Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

ArXi:2604.13491v1 Announce Type: new With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored.