UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

ArXi:2605.14731v1 Announce Type: cross Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance.