Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

We had a customer RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality. In the name of evaluation, I only had a keyword matching script producing numbers that looked like scores and meant nothing. I went in to fix this properly. Sharing what I found because most of it was not where I expected. 1. Retrieval problems disguise themselves as LLM problems.