When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

ArXi:2604.08546v1 Announce Type: new Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We