Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

ArXi:2605.17442v1 Announce Type: new Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We