Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

ArXi:2411.16748v5 Announce Type: replace Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait consistency, temporal coherence, and computational efficiency. As video length increases, issues such as visual degradation, portrait drift, temporal artifacts, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results.