Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

ArXi:2605.19528v1 Announce Type: new 3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical