Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

ArXi:2605.18601v1 Announce Type: new Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time.