AI RESEARCH
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
arXiv CS.AI
•
ArXi:2604.15464v1 Announce Type: cross Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving.