SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer

ArXi:2502.11094v2 Announce Type: replace-cross Current text-to-speech (TTS) models face a persistent limitation: autoregressive (AR) models suffer from low generation efficiency, while modern non-autoregressive (NAR) models experience high latency due to their unordered temporal nature. To bridge this divide, we