Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

ArXi:2604.01761v1 Announce Type: new Video models have recently been applied with success to problems in content generation, novel view synthesis, and, broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models.