Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

ArXi:2605.12145v1 Announce Type: new Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We