Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

ArXi:2604.11424v1 Announce Type: new Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware.