AI RESEARCH
Improving Joint Audio-Video Generation with Cross-Modal Context Learning
arXiv CS.CV
•
ArXi:2603.18600v1 Announce Type: new The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal