MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

ArXi:2605.10769v1 Announce Type: cross The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.