WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

ArXi:2605.06407v1 Announce Type: cross Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems.