Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

ArXi:2602.23153v2 Announce Type: replace-cross Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds.