Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

ArXi:2507.10934v3 Announce Type: replace-cross Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we.