EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

ArXi:2510.06371v2 Announce Type: replace-cross Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We