DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

ArXi:2603.06090v1 Announce Type: cross Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we