Getting to the Point: Why Pointing Improves LVLMs

ArXi:2603.21746v1 Announce Type: new Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism s these gains and its relevance in cognitive tasks.