JoyStreamer: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

ArXi:2602.00702v2 Announce Type: replace Existing video avatar models have nstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations.