Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

ArXi:2602.04872v2 Announce Type: replace-cross Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We