Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review

ArXi:2605.08963v1 Announce Type: cross Machine Learning (ML) models trained on complex health surveys such as the National Health and Nutrition Examination Survey (NHANES) often ignore primary sampling units, stratification variables, and sampling weights. This practice violates the independence assumptions of standard evaluation methods. As a result, estimates become biased, uncertainty is underestimated, and fairness assessments fail to reflect population-level disparities.