Been building a RAG system over a codebase and hit a wall I can't seem to get past

Every time I change something like chunk size, embedding model or retrieval top-k, I have no reliable way to tell if it actually got better or worse. I end up just manually testing a few queries and going with my gut. Curious how others handle this: - Do you have evals set up? If so, how did you build them? - Do you track retrieval quality separately from generation quality? - How do you know when a chunk is the problem vs the prompt vs the model? Thanks in advance! submitted by /u/LeaderUpset4726 [link] [comments.