I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification

ArXi:2604.09697v1 Announce Type: cross Test-time augmentation (TTA)--aggregating predictions over multiple augmented copies of a test input--is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M.