Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

ArXi:2603.17889v1 Announce Type: new Recent advances have nstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we