AI RESEARCH
Language corpora for the Dutch medical domain
arXiv CS.CL
•
ArXi:2604.25374v1 Announce Type: new \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35B tokens across the medical domain in about 100M documents, freely available on Hugging Face. \\ \textbf{