AI RESEARCH
Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
arXiv CS.CL
•
ArXi:2605.07453v1 Announce Type: new Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model.