Prompting language influences diagnostic reasoning and accuracy of large language models

ArXi:2605.19173v1 Announce Type: new Large language models (LLMs) are increasingly explored for clinical decision, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B