DoorDash Builds LLM Conversation Simulator to Test Customer Support Chatbots at Scale

DoorDash engineers built a simulation and evaluation flywheel to test large language model customer chatbots at scale. The system generates multi-turn synthetic conversations using historical transcripts and backend mocks, evaluates outcomes with an LLM-as-judge framework, and enables rapid iteration on prompts, context, and system design before production deployment. By Leela Kumili