Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

ArXi:2604.24763v1 Announce Type: new Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We