Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

ArXi:2604.18203v1 Announce Type: new Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore