When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

ArXi:2603.00314v2 Announce Type: replace-cross As Large Language Models (LLMs) are increasingly integrated into healthcare to address complex inquiries, ensuring their reliability remains a critical challenge. Recent studies have highlighted that generic LLMs often struggle in clinical contexts, occasionally producing misleading guidance. To mitigate these risks, this research focuses on the domain-specific adaptation of \textbf{Llama-2-7B} using the \textbf{Low-Rank Adaptation (LoRA)} technique.