TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

ArXi:2603.17220v1 Announce Type: cross The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7M people across the Terai belt of Nepal and India, exemplifies this crisis.