Medical AI gets 66% worse when you use automated labels for training, and the benchmark hides it! [R][P]

A recent work on fairness in medical segmentation for breast cancer tumors found that segmentation models work way worse for younger patients. Common explanation: higher breast density = harder cases. But this is not it. The bias is qualitative -- younger patients have tumors that are larger, variable, and fundamentally harder to for automated labels may amplify bias in your model by 40%. But the benchmark does not show it due to the 'biased ruler' effect, in which using biased labels to measure performance may mask true performance.