Reducing AI agent token consumption by 90% by fixing the retrieval layer

r/artificial
Machine Learning Generative AI

Quick insight from building retrieval infrastructure for AI agents: Most agents stuff 50,000 tokens of context into every prompt. They retrieve 200 documents by cosine similarity, hope the right answer is somewhere in there, and let the LLM figure it out. When it doesn't, and it often doesn't, the agent re-retrieves. Every retry burns tokens and money. We built a retrieval engine called Shaped that gives agents 10 ranked results instead of 200. The results are scored by ML models trained on actual interaction data, not just embedding similarity.