Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

ArXi:2601.15161v2 Announce Type: replace-cross Large Language Models (LLMs) are increasingly used for clinical decision, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert-authored fine-grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics.