Diffusion-State Policy Optimization for Masked Diffusion Language Models

ArXi:2602.06462v3 Announce Type: replace-cross Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions.