StochasTok: Improving Fine-Grained Subword Understanding in LLMs

ArXi:2506.01687v3 Announce Type: replace Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with simple subword-level tasks like 'How many r's in strawberry?'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words.