Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

ArXi:2603.06043v1 Announce Type: new Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, nstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes.