Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

ArXi:2604.26951v1 Announce Type: cross Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer.