S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

ArXi:2604.18512v1 Announce Type: new Vision-Language Models (VLMs) have nstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 andbypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we.