A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition

ArXi:2605.03671v1 Announce Type: new The most commonly used metrics for evaluating automatic speech transcriptions, namely Word Error Rate (WER) and Character Error Rate (CER), have been heavily criticized for their poor correlation to human perception and their inability to take into account linguistic and semantic information. While metric-based embeddings, seeking to approximate human perception, have been proposed, their scores remain difficult to interpret, unlike WER and