Interpretable Perception and Reasoning for Audiovisual Geolocation

ArXi:2603.05708v1 Announce Type: new While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we