Nemotron-Labs-Diffusion from NVIDIA

r/LocalLLaMA
Generative AI AI Hardware AI Research

Model Overview Nemotron-Labs-Diffusion is a tri-mode language model that s both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency.