Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

ArXi:2604.16656v1 Announce Type: new All languages are equal; when it comes to tokenization, some are equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many languages as they do for English. Our analysis reveals that this issue, known as 'token over-fragmentation', persists in modern open-weight LLMs.