Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

ArXi:2603.12642v1 Announce Type: cross Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m