AI RESEARCH

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

arXiv CS.AI

ArXi:2603.19831v1 Announce Type: cross Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech.