TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

ArXi:2605.13429v1 Announce Type: new Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the