Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

ArXi:2411.15115v3 Announce Type: replace-cross Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We