Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

ArXi:2509.22258v5 Announce Type: replace-cross Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We