A Family of LLMs Liberated from Static Vocabularies

ArXi:2603.15953v1 Announce Type: cross Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70B parameters based on the hierarchical autoregressive transformer (HAT) architecture.