BERnaT: Basque Encoders for Representing Natural Textual Diversity

ArXi:2512.03903v2 Announce Type: replace-cross Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text.