GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

ArXi:2604.12630v1 Announce Type: cross Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pre