FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

ArXi:2604.18109v1 Announce Type: new This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines.