SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

ArXi:2605.17949v1 Announce Type: new Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery.