Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

ArXi:2602.23136v2 Announce Type: replace-cross Numerous studies have shown that multimodal LLMs process speech and images well but fail in non-intuitive ways rendering trivial tasks such as object counting unreliable.