SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

ArXi:2512.11192v2 Announce Type: replace SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10M scientific publications and a multilingual, unfiltered TEI XML split including than 35M publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow nstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality.