JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

ArXi:2506.23552v2 Announce Type: replace The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper