AI RESEARCH

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

arXiv CS.CL

ArXi:2602.14819v2 Announce Type: replace We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-