DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

ArXi:2605.17295v1 Announce Type: new Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space.