VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

ArXi:2510.08398v3 Announce Type: replace The recent rapid advancement of Text-to-Video (T2V) generation technologies are engaging the trained models with world model ability, making the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality-an essential property that differentiates videos from other modalities-remains largely unexplored.