Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation

ArXi:2603.06250v1 Announce Type: new Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations.