AI RESEARCH

Separate Before You Compress: The WWHO Tokenization Architecture

arXiv CS.CL

ArXi:2603.25309v1 Announce Type: new Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units.