Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

ArXi:2510.00436v2 Announce Type: replace Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent.