Pearmut: Human Evaluation of Translation Made Trivial

ArXi:2601.02933v3 Announce Type: replace Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We