Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

ArXi:2603.09715v1 Announce Type: new Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model