Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

ArXi:2605.14950v1 Announce Type: new Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships.