A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

ArXi:2604.03995v1 Announce Type: new As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we