I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes

ArXi:2603.23229v1 Announce Type: new Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning.