From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

ArXi:2507.09205v5 Announce Type: replace Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre