Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

ArXi:2604.22631v1 Announce Type: new Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs.