ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

ArXi:2603.11421v1 Announce Type: new Text-driven video generation has cratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution.