MATRIX: Mask Track Alignment for Interaction-aware Video Generation

ArXi:2510.07310v2 Announce Type: replace Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks.