Unified Multimodal Models as Auto-Encoders

ArXi:2509.09666v5 Announce Type: replace Image-to-text (I2T) understanding and text-to-image (T2I) generation are two fundamental, important yet traditionally isolated multimodal tasks. Despite their intrinsic connection, existing approaches typically optimize them independently, missing the opportunity for mutual enhancement.