Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

ArXi:2603.08022v1 Announce Type: new A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by