AI RESEARCH

Samas\=amayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

arXiv CS.CL

ArXi:2603.24307v1 Announce Type: new We release Samas\=amayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to nstrate its utility.