Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

ArXi:2604.18724v1 Announce Type: new Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks.