From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

ArXi:2509.20072v4 Announce Type: replace Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations.