Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

ArXi:2510.12834v3 Announce Type: replace-cross Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We