Compositional Video Generation via Inference-Time Guidance

ArXi:2605.14988v1 Announce Type: new Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by re