HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

ArXi:2603.15228v1 Announce Type: new Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we