Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

ArXi:2508.03453v2 Announce Type: replace-cross Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs.