Scaling Properties of Continuous Diffusion Spoken Language Models

ArXi:2604.24416v1 Announce Type: cross Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is viable. To quantify the SLMs linguistic quality, we