UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

ArXi:2603.22282v1 Announce Type: cross We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which