Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

ArXi:2604.01118v1 Announce Type: cross Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision.