Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

ArXi:2604.17393v1 Announce Type: new Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the