Probing neural audio codecs for distinctions among English nuclear tunes

ArXi:2603.14035v1 Announce Type: cross State-of-the-art spoken dialogue models (D\'efossez 2024; Schalkwyk 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns.