AI RESEARCH
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
arXiv CS.CL
•
ArXi:2411.05527v3 Announce Type: replace Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource and multilingual contexts. In this study, we subject the entirety of non-English Wikipedia to a data filtering procedure typically reserved for noisy web-text -- a process which removes a large percentage of the collection's data.