Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

ArXi:2604.17656v1 Announce Type: cross Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content.