Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

ArXi:2505.20629v3 Announce Type: replace-cross Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we