AI RESEARCH

WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women's Health Topics

arXiv CS.AI

ArXi:2604.00024v1 Announce Type: cross Large language models are increasingly used for medical guidance, but women's health remains under-evaluated in benchmark design. We present the Women's Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women's health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots.