AI RESEARCH

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

arXiv CS.AI

ArXi:2605.12153v1 Announce Type: cross We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373M lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of.