End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

ArXi:2603.10570v1 Announce Type: new Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating uned or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort.