Geometry-Guided 3D Visual Token Pruning for Video-Language Models

ArXi:2604.18260v1 Announce Type: new Multimodal large language models have nstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management.