ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

ArXi:2604.07958v1 Announce Type: new Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs.