On the Trainability of Masked Diffusion Language Models via Blockwise Locality

ArXi:2604.24832v1 Announce Type: new Masked diffusion language models (MDMs) have recently emerged as a promising alternative to standard autoregressive large language models (AR-LLMs), yet their optimization can be substantially less stable. We study blockwise MDMs and compare them with AR-LLMs on three controlled tasks that stress different aspects of structured generation: in-context linear regression, graph path-finding, and Sudoku solving. We find that standard random-masking MDMs fail to reliably dynamics on graph path-finding, while outperforming AR-LLMs on Sudoku.