AI RESEARCH

[P] Structured Prompting for Extremely Low-Resource Languages: 80% → 5% Vocabulary Contamination, No Fine-Tuning

r/MachineLearning

Most low-resource language research assumes you can fine-tune. But what happens when a language has ~2M speakers, no official script standardization, near-zero web presence, and you're working with a frozen model? We ran into this with Tulu, a Dravidian language from coastal Karnataka, India. The core failure mode is consistent across models, i.e, a prompt in Tulu, get Kannada back. The models aren't hallucinating randomly, instead they're collapsing to the nearest high-probability neighbor in the.