On the Predictive Power of Representation Dispersion in Language Models

ArXi:2506.24106v2 Announce Type: replace We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations widely tend to achieve lower perplexity. Concretely, we find that representation dispersion--the average pairwise cosine distance among hidden vectors--strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts.