ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

ArXi:2512.19703v2 Announce Type: replace-cross The dominant paradigm for Audio-Text Retrieval (ATR) relies on dual-encoder architectures optimized via mini-batch contrastive learning. However, restricting optimization to local in-batch samples creates a fundamental limitation we term the Gradient Locality Bottleneck (GLB), which prevents the resolution of acoustic ambiguities and hinders the learning of rare long-tail concepts.