MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

ArXi:2505.23120v2 Announce Type: replace Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of body parts in terms of motion amplitude, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in videos, resulting in noticeable artifacts and distortions. Existing approaches typically address this issue by adding extra prior inputs, but this can limit the practical application of the task.