AI RESEARCH

Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

arXiv CS.CL

ArXi:2603.05272v2 Announce Type: replace We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered.