AI RESEARCH

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

arXiv CS.AI

ArXi:2505.18091v3 Announce Type: replace-cross Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when