GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

ArXi:2603.13370v1 Announce Type: cross Vision-Language Models (VLMs) have nstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured.