Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

ArXi:2605.09090v1 Announce Type: cross Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are. therefore. rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues