Counterfactual Stress Testing for Image Classification Models

ArXi:2605.10894v1 Announce Type: new Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in graphics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes.