DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

ArXi:2503.14324v3 Announce Type: replace-cross The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks.