AI RESEARCH

Tokenization Tradeoffs in Structured EHR Foundation Models

arXiv CS.LG

ArXi:2603.15644v1 Announce Type: new Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored.