Getting good predictions without data cleaning (Why "Garbage In, Garbage Out" is sometimes a trap)

r/artificial
Machine Learning Data Science AI Research

Full arXi Preprint: Paper Simulation Github: Hi r/artificial, It's a dirty little secret to many of us. sometimes, downstream AI/ML models perform surprisingly well when you just hand them raw, error-prone tabular data instead of heavily curated feature sets. Despite this, the vast majority of our field tends to be fiercely loyal to "Garbage In, Garbage Out" (GIGO). While automated ETL pipelines are absolutely essential for structuring data, our workflows are still bottlenecked with endless manual cleaning and aggressive imputation just to curate pristine, error-free tables.