AI RESEARCH

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P]

r/MachineLearning

Most embedding models are not Matryoshka-trained, so naive dimension truncation tends to destroy them. I tested a simple alternative: fit PCA once on a sample of embeddings, rotate vectors into the PCA basis, and then truncate. The idea is that PCA concentrates signal into leading components, so truncation stops being arbitrary.