EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

ArXi:2603.12108v1 Announce Type: new The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively.