ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

ArXi:2410.09396v2 Announce Type: replace-cross Existing gesture generation methods primarily focus on upper body gestures based on audio features, neglecting speech content, emotion, and locomotion. These limitations result in stiff, mechanical gestures that fail to convey the true meaning of audio content. We