SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

ArXi:2605.13667v1 Announce Type: new Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models.