AI RESEARCH
Sampling from Your Language Model One Byte at a Time
arXiv CS.LG
•
ArXi:2506.14123v3 Announce Type: replace-cross Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can