VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

ArXi:2505.20291v4 Announce Type: replace-cross Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We proposeVisualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment.