Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

ArXi:2603.18002v1 Announce Type: new Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We