AI RESEARCH

Language corpora for the Dutch medical domain

arXiv CS.CL

ArXi:2604.25374v1 Announce Type: new \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35B tokens across the medical domain in about 100M documents, freely available on Hugging Face. \\ \textbf{