Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

ArXi:2603.19261v1 Announce Type: cross Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper