AI RESEARCH

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

arXiv CS.AI

ArXi:2602.02320v3 Announce Type: replace-cross Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular descriptions that preserve complete structural details at scale.