Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

ArXi:2508.04325v2 Announce Type: replace-cross Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we