SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

ArXi:2605.12500v1 Announce Type: new Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we