LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

ArXi:2605.08156v1 Announce Type: cross Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal.