Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

ArXi:2602.00217v2 Announce Type: replace Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in smaller models. We observe a geometric phenomenon which we term $\textit{\textbf{embedding condensation}}$, where token embeddings collapse into a narrow cone-like subspace in some language models.