Beyond Referring Expressions: Scenario Comprehension Visual Grounding

ArXi:2604.02323v1 Announce Type: new Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We