The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

ArXi:2412.08110v3 Announce Type: replace-cross Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from