Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

ArXi:2506.21546v4 Announce Type: replace-cross Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label.