AI RESEARCH

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

arXiv CS.CL

ArXi:2501.01046v4 Announce Type: replace Dataset deduplication is widely recognized as a crucial preprocessing step that enhances data quality and improves the performance of large language models. A commonly used method for this process is the MinHash Locality-Sensitive Hashing (LSH) algorithm. Recently, GPU-accelerated frameworks such as NVIDIA NeMo Curator have been