AI RESEARCH

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

arXiv CS.CL

ArXi:2604.18423v1 Announce Type: new India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties.