ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

ArXi:2603.23326v1 Announce Type: new Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end