AI RESEARCH

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

arXiv CS.AI

ArXi:2505.00022v3 Announce Type: replace-cross Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and