AI RESEARCH
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
arXiv CS.AI
•
ArXi:2505.00022v3 Announce Type: replace-cross Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and