Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

ArXi:2512.03989v2 Announce Type: replace Tokenizer adaptation plays an important role in adapting pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued.