Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration

ArXi:2601.18921v2 Announce Type: replace-cross The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a of integrating three major public chemical repositories: PubChem (176M compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction.