SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

ArXi:2603.29437v1 Announce Type: new Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably