Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic

ArXi:2510.17001v2 Announce Type: replace Large language models (LLMs) often encode word-form variation (e.g., walk vs. walked) as linear directions in the embedding space. However, standard tokenization algorithms treat such variants as distinct words with different vocabulary entries, quickly filling the size-capped token vocabulary with surface-form variation (e.g., walk, walking, Walk) at the expense of diversity and multilingual coverage.