AI RESEARCH
[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P]
r/MachineLearning
•
Most embedding models are not Matryoshka-trained, so naive dimension truncation tends to destroy them. I tested a simple alternative: fit PCA once on a sample of embeddings, rotate vectors into the PCA basis, and then truncate. The idea is that PCA concentrates signal into leading components, so truncation stops being arbitrary.