What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

ArXi:2510.13232v2 Announce Type: replace-cross State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we