Retrieving Counterfactuals Improves Visual In-Context Learning

ArXi:2603.16737v1 Announce Type: cross Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of nstration examples.