SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

ArXi:2602.23956v2 Announce Type: replace Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a