From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

ArXi:2604.14053v1 Announce Type: new Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity.