I Tried Vector Search on Molecules — Here’s What Happened

Towards AI
Generative AI NLP

Photo by author How I built a molecular similarity search system using ChemBERTa, RDKit, and a vector database, and what I learned along the way. Why I Wanted to Do This I have been spending a lot of time lately experimenting with vector databases and embedding-based search. Most examples I came across focused on text: semantic document search, FAQ retrieval systems, or chatbot memory. At some point, I started thinking: could the same idea work for molecules? Around that time, I had also been reading about ChemBERTa, a transformer model trained on SMILES strings from the ZINC database.