AI RESEARCH

NPU Design for Diffusion Language Model Inference

arXiv CS.AI

ArXi:2601.20706v2 Announce Type: replace-cross Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top-$k$-driven sampling stage, demand new ISA and memory hierarchy beyond that of AR accelerators.