Gumbel Distillation for Parallel Text Generation

ArXi:2603.22216v1 Announce Type: cross The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we