Safety and accuracy follow different scaling laws in clinical large language models

ArXi:2605.04039v1 Announce Type: cross Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter than average benchmark performance. We