PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

ArXi:2604.15770v1 Announce Type: new Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes.