AI RESEARCH

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

arXiv CS.CL

ArXi:2604.17105v1 Announce Type: new Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we