Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

ArXi:2605.01896v1 Announce Type: new Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^2$-REPA, the first representation alignment method tailored for multi-modal video generation.