AI RESEARCH

Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

arXiv CS.LG

ArXi:2605.07022v1 Announce Type: new Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, nuanced, and accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that.